Department of Acoustic Design, Faculty of Design
Kazuo UEDA, Associate Professor
We frequently encounter a situation where we have to hear out a target speech partly masked by other sounds. Let's take visual examples to explain such a case. Reading sentences becomes extremely difficult when about half of each character is erased (Fig. 1). By contrast, if the same sentences are masked by black bars, you'll feel less embarrassed (Fig. 2). The situation where a part of a target is covered by the other object looks familiar to us. The brain automatically restores the covered part if possible. A similar effect has been found in audition (interrupted speech with masking noise) and was named a picket fence effect (Miller and Licklider, 1950). However, a caveat would be due here: The horizontal axis in the visual examples has to be converted into time in auditory examples. Thus, the nature of the horizontal axis is different in vision and audition.
Going back to visual examples, what happens if the interruption or masking occurs not only in the horizontal direction but also in the vertical direction? Examples of erasing and masking with a checkerboard pattern are shown in Figures 3 and 4. You may find them also difficult to read. At the same time, you can easily imagine that the readability of such examples should largely depend on the size of the checkerboard patterns. In the case of audition, what corresponds to the horizontal axis in the visual examples is time; the vertical axis is frequency (Fig. 5). However, because the nature of the axes in audition is different from vision, it is not easy to estimate how we can perceive the auditory stimuli erased or masked by a checkerboard pattern. Therefore, we conducted the first systematic study on the intelligibility of checkerboard speech stimuli with meaningful sentences varying the segment duration and the number of frequency bands (Fig. 6; Ueda et al., 2021). The results were surprising (Fig. 7). When the frequency axis was divided into 20 frequency bands, the intelligibility of the checkerboard speech stimuli was perfect, irrespective of the segment duration. However, when the frequency axis was divided into 2 or 4 frequency bands (Ueda and Nakajima, 2017), the intelligibility was generally lower than the intelligibility of the interrupted speech stimuli with the corresponding segment duration and reached the minimum at the 160-ms segment duration. This pattern of results is closely related to our ability to integrate fragments of speech into a coherent stream.
It has been known that the pitch of speech is a cue to integrate speech segments (Apoux and Healy, 2013; Clarke, Başkent, and Gaudrain, 2016). If such a pitch cue is removed with mosaicking---"mosaicking" means that the power within each patch is averaged and the patch is filled with the noise of the averaged power (Fig. 8b; Nakajima et al., 2018; Eguchi et al., 2022)---, the intelligibility of interrupted mosaic speech stimuli (Fig. 8c) sharply declines (Ueda et al., 2021; Ueda et al., 2022). However, we have shown that the declined intelligibility of the interrupted mosaic speech stimuli can be recovered (Fig. 9) by stretching the mosaic segments and shrinking the silent gaps (Fig. 8d, e; Ueda et al., 2022). We conjecture that the recovery is brought about by auditory grouping, which works on the Gestalt principle of proximity. Thus, these investigations shed light on how our auditory system organizes speech fragments and perceives speech under challenging conditions.
▪Apoux, F., & Healy, E. W. (2013). A Glimpsing Account of the Role of Temporal Fine Structure Information in Speech Recognition. In B. C. J. Moore, R. D. Patterson, I. M. Winter, R. P. Carlyon, & H. E. Gockel (Eds.), Basic Aspects of Hearing: Physiology and Perception (Vol. 787, pp. 119-126). New York, NY.: Springer.
▪Clarke, J., Başkent, D., & Gaudrain, E. (2016). Pitch and spectral resolution: A systematic comparison of bottom-up cues for top-down repair of degraded speech. The Journal of the Acoustical Society of America, 139(1), 395-405. doi:10.1121/1.4939962
▪Eguchi, H., Ueda, K., Remijn, G. B., Nakajima, Y., & Takeichi, H. (2022). The common limitations in auditory temporal processing for Mandarin Chinese and Japanese. Scientific Reports, 12(1), 3002. doi:10.1038/s41598-022-06925-x
▪Miller, G. A., & Licklider, J. C. R. (1950). The intelligibility of interrupted speech. The Journal of the Acoustical Society of America, 22(2), 167-173. doi:10.1121/1.1906584
▪Nakajima, Y., Matsuda, M., Ueda, K., & Remijn, G. B. (2018). Temporal resolution needed for auditory communication: Measurement with mosaic speech. Frontiers in Human Neuroscience, 12, 149. doi:10.3389/fnhum.2018.00149
▪Sagan, C. (1996). The Demon-Haunted World: Science as a candle in the dark. New York: Ballantine Books.
▪Ueda, K., & Nakajima, Y. (2017). An acoustic key to eight languages/dialects: Factor analyses of critical-band-filtered speech. Scientific Reports, 7, 42468. doi:10.1038/srep42468
▪Ueda, K., Kawakami, R., & Takeichi, H. (2021). Checkerboard speech vs interrupted speech: Effects of spectrotemporal segmentation on intelligibility. JASA Express Letters, 1(7), 075204. doi:10.1121/10.0005600
▪Ueda, K., Takeichi, H., & Wakamiya, K. (2022). Auditory grouping is necessary to understand interrupted mosaic speech stimuli. The Journal of the Acoustical Society of America, 152(2), 970-980. doi:10.1121/10.0013425