研究・産学官民連携 Research

Perceptual restoration of degraded speech: The Auditory system successfully integrates speech fragments into a meaningful stream, but not always

Research Projects and Initiatives

Recent Studies at Faculty of Design

Perceptual restoration of degraded speech: The Auditory system successfully integrates speech fragments into a meaningful stream, but not always

Department of Acoustic Design, Faculty of Design
 Kazuo UEDA, Associate Professor

We frequently encounter a situation where we have to hear out a target speech partly masked by other sounds. Let's take visual examples to explain such a case. Reading sentences becomes extremely difficult when about half of each character is erased (Fig. 1). By contrast, if the same sentences are masked by black bars, you'll feel less embarrassed (Fig. 2). The situation where a part of a target is covered by the other object looks familiar to us. The brain automatically restores the covered part if possible. A similar effect has been found in audition (interrupted speech with masking noise) and was named a picket fence effect (Miller and Licklider, 1950). However, a caveat would be due here: The horizontal axis in the visual examples has to be converted into time in auditory examples. Thus, the nature of the horizontal axis is different in vision and audition.

Figure 1. An example of interrupted English sentences. About half of the characters are erased. The interruption severely hampers reading. The original sentences were taken from Sagan (1996).

Figure 2. The same sentences as in Figure 1 are masked by black bars of the same width and interval as the white bars in Figure 1. The masking also hampers reading, but the difficulty is somewhat milder compared to the case in Figure 1. The original sentences were taken from Sagan (1996).

Going back to visual examples, what happens if the interruption or masking occurs not only in the horizontal direction but also in the vertical direction? Examples of erasing and masking with a checkerboard pattern are shown in Figures 3 and 4. You may find them also difficult to read. At the same time, you can easily imagine that the readability of such examples should largely depend on the size of the checkerboard patterns. In the case of audition, what corresponds to the horizontal axis in the visual examples is time; the vertical axis is frequency (Fig. 5). However, because the nature of the axes in audition is different from vision, it is not easy to estimate how we can perceive the auditory stimuli erased or masked by a checkerboard pattern. Therefore, we conducted the first systematic study on the intelligibility of checkerboard speech stimuli with meaningful sentences varying the segment duration and the number of frequency bands (Fig. 6; Ueda et al., 2021). The results were surprising (Fig. 7). When the frequency axis was divided into 20 frequency bands, the intelligibility of the checkerboard speech stimuli was perfect, irrespective of the segment duration. However, when the frequency axis was divided into 2 or 4 frequency bands (Ueda and Nakajima, 2017), the intelligibility was generally lower than the intelligibility of the interrupted speech stimuli with the corresponding segment duration and reached the minimum at the 160-ms segment duration. This pattern of results is closely related to our ability to integrate fragments of speech into a coherent stream.

Figure 3. This time the sentences are interrupted by a checkerboard pattern, which erases about half of the area. It can be easily imagined that the effect of erasing largely depends on the size of the checkerboard pattern. The original sentences were taken from Sagan (1996).

Figure 4. A checkerboard pattern masks the sentences. The original sentences were taken from Sagan (1996).

Figure 5. Spectrogram examples of the stimuli used by Ueda, Kawakami, and Takeichi (2021). The darkness shows energy concentration. (a) Original speech stimuli, (b) interrupted speech stimuli with the 80-ms segment duration, (c) checkerboard speech stimuli with the four frequency bands and 80-ms segment duration, and (d) mosaicked checkerboard speech stimuli with the same time-frequency segmentation ("mosaicked" means that the power within each patch was averaged and the patch was filled with the noise of the averaged power). From Ueda et al. (2021).

Figure 6. The inside of the sound-attenuated booth. By courtesy of Research & Development Center for Five-Sense Devices, Kyushu University.

Figure 7. Results by Ueda et al. (2021). Mean percentages of mora accuracy (n = 20) for the interrupted speech and checkerboard speech stimuli as a function of segment duration and number of frequency bands. "Mora" is a syllable-like unit in Japanese. Error bars reflect the standard error of the mean (SEM). From Ueda et al. (2021).

It has been known that the pitch of speech is a cue to integrate speech segments (Apoux and Healy, 2013; Clarke, Başkent, and Gaudrain, 2016). If such a pitch cue is removed with mosaicking---"mosaicking" means that the power within each patch is averaged and the patch is filled with the noise of the averaged power (Fig. 8b; Nakajima et al., 2018; Eguchi et al., 2022)---, the intelligibility of interrupted mosaic speech stimuli (Fig. 8c) sharply declines (Ueda et al., 2021; Ueda et al., 2022). However, we have shown that the declined intelligibility of the interrupted mosaic speech stimuli can be recovered (Fig. 9) by stretching the mosaic segments and shrinking the silent gaps (Fig. 8d, e; Ueda et al., 2022). We conjecture that the recovery is brought about by auditory grouping, which works on the Gestalt principle of proximity. Thus, these investigations shed light on how our auditory system organizes speech fragments and perceives speech under challenging conditions.

Figure 8. Spectrogram examples of the stimuli used by Ueda, Takeichi, and Wakamiya (2022), depicting (a) the original speech sample, (b) mosaic speech with four frequency bands (passbands of 50-570, 570-1600, 1600-3400, and 3400-7000 Hz) and the 40-ms segment duration, including 5-ms root-of-raised-cosine ramps in amplitude, (c) interrupted mosaic speech, (d) interrupted and stretched mosaic speech in which each segment was stretched in time by a factor of 1.5 and the silent gaps were shrunk by a factor of 0.5, and (e) interrupted and stretched mosaic speech in which each segment was stretched in time by a factor of 2.0 and the silent gaps were removed. From Ueda et al. (2022).

Figure 9. The mora accuracy in percentages as a function of stretching ratio and number of frequency bands for the stimuli with the 20-ms original segment duration (n = 12). From Ueda et al. (2022).

References

▪Apoux, F., & Healy, E. W. (2013). A Glimpsing Account of the Role of Temporal Fine Structure Information in Speech Recognition. In B. C. J. Moore, R. D. Patterson, I. M. Winter, R. P. Carlyon, & H. E. Gockel (Eds.), Basic Aspects of Hearing: Physiology and Perception (Vol. 787, pp. 119-126). New York, NY.: Springer.

▪Clarke, J., Başkent, D., & Gaudrain, E. (2016). Pitch and spectral resolution: A systematic comparison of bottom-up cues for top-down repair of degraded speech. The Journal of the Acoustical Society of America, 139(1), 395-405. doi:10.1121/1.4939962

▪Eguchi, H., Ueda, K., Remijn, G. B., Nakajima, Y., & Takeichi, H. (2022). The common limitations in auditory temporal processing for Mandarin Chinese and Japanese. Scientific Reports, 12(1), 3002. doi:10.1038/s41598-022-06925-x

▪Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech. The Journal of the Acoustical Society of America, 22(2), 167-173. doi:10.1121/1.1906584

▪Nakajima, Y., Matsuda, M., Ueda, K., & Remijn, G. B. (2018). Temporal resolution needed for auditory communication: Measurement with mosaic speech. Frontiers in Human Neuroscience, 12, 149. doi:10.3389/fnhum.2018.00149

▪Sagan, C. (1996). The Demon-Haunted World: Science as a candle in the dark. New York: Ballantine Books.

▪Ueda, K., & Nakajima, Y. (2017). An acoustic key to eight languages/dialects: Factor analyses of critical-band-filtered speech. Scientific Reports, 7, 42468. doi:10.1038/srep42468

▪Ueda, K., Kawakami, R., & Takeichi, H. (2021). Checkerboard speech vs interrupted speech: Effects of spectrotemporal segmentation on intelligibility. JASA Express Letters, 1(7), 075204. doi:10.1121/10.0005600

▪Ueda, K., Takeichi, H., & Wakamiya, K. (2022). Auditory grouping is necessary to understand interrupted mosaic speech stimuli. The Journal of the Acoustical Society of America, 152(2), 970-980. doi:10.1121/10.0013425

Websites

■Contact

Department of Acoustic Design, Faculty of Design
Kazuo UEDA, Associate Professor