Section 5.2. Possible applications of the GEMS signal and the excitation function
Now that we have demonstrated the ability to generate excitation functions for the vocal tract using the GEMS signal, we discuss the possible applications in which the GEMS might prove useful in the sciences of speech and audio processing.
As the GEMS detects the motion of the trachea due to the changing subglottal pressure, which changes only because the folds are operating and modulating air, at its very simplest the GEMS can detect when a person is using their vocal folds. This detection is unaffected by acoustic noise and can tell a speech processor (such as one in a cell phone or speech recognition program) when the person is or is not speaking. It is far more reliable than a simple squelch control and could markedly extend battery life, as cell phones would not have to process audio signals that do not contain speech.
The GEMS signal has a sharp drop through zero at the time when the folds are observed to close. As this is defined as the beginning of the voiced glottal cycle, the GEMS signal may be easily used to detect the beginning of the cycle. This has many advantages.
The first is for data parsing (or windowing), in which the data is cut into sections for processing. Current technology uses blind windows, which are the same length regardless of the type or pitch of speech. Windows of 30-40 milliseconds are common, with a 10-20 millisecond overlap between windows. Such intensive processing is needed because the placement of the glottal cycle is not known.
With the GEMS, data may be cut into windows consisting of n glottal cycles, with n any integer. We know exactly where each cycle begins and ends and so we may use adaptive windowing, where the window is sized according to the pitch of the cycles under study. We have found it most useful to use windows that are 2 glottal cycles long, as there is not much change in the vocal tract over this time period (10-20 msec) and 10-20 msec allows for adequate resolution in our Fourier transforms. In addition, the folds may start and stop vibrating in only 2-3 cycles, so it allows us to detect speech onset and offset reliably. As a bonus, Fourier transforms are more accurate and stable when the signal under transformation contains a whole number of wavelengths of the fundamental frequency.
Another application that could utilize the glottal cycle location information is the measurement of pitch. Appendix C contains a paper under consideration for publication by the IEEE Speech and Audio Processing journal. It is entitled "Accurate and noise-robust pitch extraction using low power electromagnetic sensors" and contains many examples of how the GEMS-derived pitch is far superior to any other process in use today. In essence, the GEMS-derived pitch is immune to acoustic noise, far more accurate, and demands about 1% of the processing power required for conventional speech recognition. As the pitch may be measured very accurately as often as every glottal cycle, very small changes in pitch may be distinguished. This would be useful in measuring vibrato (nominally a < 3% change in amplitude at a frequency of 4-6 Hz, Titze 1994) as well as detecting voice disorders that manifest themselves as minute changes in pitch.
We may use the transfer function calculated with the excitation function in many ways to improve speech and audio processing and coding. One of the simplest would be the calculation of formants from the transfer function. The location of the first 2 or 3 formants would specify what phoneme was being voiced during the window (see Figure 2.20 for a graphic of 10 vowels and their first two formant locations, and see the previous section for examples of transfer functions and formant locations).
The transfer functions may be used in many different ways. One of them, speech recognition, could be enhanced by the phoneme detection described above. The phoneme information could be used on its own to build a recognizer, or it could be used to supplement present designs. For phoneme detection, only a small number of poles and zeros are necessary to capture the gross structure of the transfer function. For the applications below, it may be necessary to use more poles and zeros in order to capture the nuances of the transfer function structure.
The calculated transfer functions could also be used for security applications such as speaker verification or identification. Speaker verification verifies the identity of a subject through the parameters associated with his or her voice. Speaker identification identifies a speaker through speech parameters without being told who the subject is. Both methods are currently susceptible to "spoofing" through the use of high quality audio recorders. Since the GEMS measures physical tissue vibration, it is immune to spoofing in such a manner and can be considered a biometric (body-measuring) sensor. It might be that the difference in GEMS signal between individuals could be used as a tool for identification. It is also possible to compare transfer functions for phonemes or words to establish identity. How well this might work is dependent both upon the repeatability of the GEMS measurements and the consistency of an individuals return. These factors are being evaluated now by a colleague, Todd J. Gable, as he prepares for his thesis on the topic of speaker verification using the GEMS.
The next possible application is that of speaker synthesis. I have purposely used the word "speaker" instead of speech, as we are able to capture the details of a speakers transfer functions by the use of a sufficient (around 14) number of poles and zeros. We may then reproduce the speech by filtering the excitation function with the transfer function. A library of phonetic transfer functions for an individual may be composed and used with a phonetic speech synthesizer to reproduce an individuals speech. The accuracy of the reproduction would depend only on the accuracy and stability of the transfer functions.
It a similar manner, it appears possible to synthesize musical instruments. The excitation function would be calculated from the position vs. time of a suitable object (such as the string of a violin) and the transfer function of the instrument calculated.
Finally, a possible field of applications for the excitation/transfer function information is voice coders (vocoders). These are the devices and algorithms that digitize speech for digital transmission. The GEMS-derived excitation would be parameterized and transmitted. As it does not change for the same register and intensity of voice, it would need to be updated infrequently. Coded speech segments would require transmission of only the pitch and the transfer function parameters for recombination and synthesis at the receiver.