研究・産学官民連携 Research

Future Voice Communication Technologies Creating Using AI

Research Projects and Initiatives

Recent Studies at Faculty of Design

Future Voice Communication Technologies Creating Using AI

Department of Communication Design Science, Faculty of Design
Research Center for Applied Perceptual Science
Professor Tokihiko Kaburagi

We live life surrounded by sound. People are able to process the sounds they hear with their ears in real time, and extract meaningful information from amongst them. Thanks to this instantaneous information processing, we can communicate extremely efficiently through sounds. Foremost amongst these is spoken language (voice), chiefly the meaning of words, emotions and the individuality of the speaker. The voice has the ability to convey various information to the listener both simultaneously and instantaneously.

In my laboratory I mainly conduct research about the composition of human voice communications from the perspective of a speaker transmitting information. Mechanisms in our bodies are involved in the act of speaking, the respiratory system including the mouth, throat and lungs in particular, is deeply involved. At this laboratory, in order to understand the mechanisms of human speech, we emphasize observing the states of the mouth and the throat and representing these using mathematical models. Mouth movements (lips, tongue, lower jaw, soft palate etc.) are very important in speaking a language, however it is difficult to directly observe the movements of the tongue and soft palate externally. At this laboratory, we have developed magnetic sensors to measure the movements of these speech organs in three dimensions, and are using them in our research (Figure 1). This is the leading specialized motion capture system for the measurement of the motion of the speech organs in Japan.

Figure 1: Electromagnetic sensor for observing motion of the speech organs in three dimensions

Additionally, in recent years we have broadened the focus of our research to include prosodic communications through sound, and we are further studying the various styles of singing voice and the interactions between wind instruments and performers. As shown in Figure 2, MRI (Magnetic Resonance Imaging) allows for the cross-sectional imaging of any part within the body. In the overtone playing style for the saxophone, since pitch is changed without pressing the keys, we found that the performer actively adjusts the shape and acoustic characteristics of their oral cavity in this method. Learning to play a musical instrument is thought to often be done based on auditory feedback, i.e., the interaction between human motor and auditory systems, however this research result offers a scientific foundation to the empirical performance of playing musical instruments.

Figure 2: Vocal-tract images of a saxophone player taken with MRI device

Below I introduce the studies we are currently advancing using AI (artificial intelligence), namely machine learning (deep-layer learning).

■Automatic Detection of Pathological Speech using AI

Medical applications of AI have continued to progress rapidly in recent years. Among medical imaging support technologies for example, the development of technology to support the detection of the presence and progress of disease by using AI to automatically indicate abnormalities on CT, MRI and endoscope imagery has been widely undertaken. At our laboratory, addressing the GRBAS scale (refer to the note below) used by doctors and speech-language-hearing therapists when numerically expressing the auditory impressions of pathological speech (hoarseness), we are conducting research into the use of AI to automatically assign a score for each item on the scale to a patient's voice. This kind of information processing system is expected to not only be of use at technology supporting medical institutions, but also in screening for disease and its early detection. This research is being conducted in collaboration with the university's medical research institutes' Department of Otorhinolaryngology.

■Voice Replacement Technology Using AI-driven Voice Synthesis

When the larynx is removed due to diseases like cancer, speech becomes impossible, posing a major hindrance to everyday communication. Methods in voice replacement in such cases generally include an artificial electrolarynx, esophageal speech, shunt speech etc., however these methods have their own respective issues such as producing an unnatural voice lacking intonation, being difficult to learn and periodically requiring expensive surgeries. In this research, because speech organs in the oral cavity are unaffected by laryngeal disease, we are developing an information processing system synthesizing a voice using only "lip syncing," so to speak, from the movements of the mouth.

Figure 3 shows the movement patterns of the speech organs measured using magnetic sensors, a spectrogram of synthesized speech and the spectrogram of speech where an actual person spoke the same sentence. In order to generate synthesized speech, it is necessary to reflect in some form of the acoustic characteristics of the oral cavity and information about the sound source (voice pitch, volume, distinctions between voiced and unvoiced etc.), however, the difficulty in "lip-sync" voice synthesis is that there is virtually no direct causal relation between movements of the speech organs and information about the sound source. In this research, we tackle this problem by structurally incorporating mid to long-term time series data of speech organ movements into a deep neural network. Since the use of AI makes it possible to control synthesized sounds flexibly, we ultimately aim to allow laryngectomy patients to be able to synthesize speech with the qualities of their own original voice.

Figure 3: Movement patterns of the speech organs (left), spectrogram of synthesized speech (mid), and spectrogram of original speech (right)

Note)The GRBAS Scale
A method of auditory psychological evaluation of hoarseness, in which, grade, which expresses the aggregate extent of hoarseness, rough, which expresses roughness in a voice, breathy, which expresses breathiness, asthenic, which expresses a lack of force, and strained, which expresses exertion, are evaluated using a 4-stage scale.

■Inquiries
Department of Communication Design Science, Faculty of Design, Kyushu University
Professor Tokihiko Kaburagi