Return to Table of Contents

Appendix C.

Accurate and noise-robust pitch extraction using low power electromagnetic sensors

G.C. Burnett, T.J. Gable, L.C. Ng, and J.F. Holzrichter

Lawrence Livermore National Laboratory, POB 808, Livermore CA 94551

Abstract: A new, noninvasive, safe, and robust method of pitch estimation has been developed at the Lawrence Livermore National Laboratory (LLNL) utilizing Glottal Electromagnetic Micropower Sensors (GEMS). They operate in the microwave regime of the EM spectrum at a peak power of less than 1 milliwatt and use a field-disturbance mode of reception in which signals are obtained only from moving tissue. Research has shown that the GEMS signal is strongly correlated with the structure responsible for pitch (the vocal folds), making the signal an excellent source for extremely accurate pitch extraction. The accuracy of the GEMS sensor and corresponding algorithm is validated using tuning forks, synthetic signals, and high-speed video images. It is then compared to two traditional audio-only methods (cepstral and autocorrelation) in normal and noisy environments. This new method using the GEMS and a simple zero-crossing algorithm is shown to be the most accurate, robust, and efficient. These qualities are valuable in many applications such as speaker verification, pitch-synchronous signal processing, noise suppression, pitch training for vocalists, and voice training for the disabled.

EDICS # SA 1.2.2

Corresponding Author: Gregory C. Burnett

Lawrence Livermore National Laboratory

POB 808 L-271

Livermore, CA 94551

Phone: 925-423-3088

Fax: 925-422-7309

Email: burnett5@llnl.gov

I. Introduction

The use of the GEMS in relation to speech applications was first explored by Holzrichter et al. [1]. The authors demonstrated how the GEMS and other EM sensors could be used to measure vocal articulator motion in real time for speech characterization. One particularly interesting speech characteristic (measurable by the GEMS) is the closing of the vocal folds during their oscillation. This is normally defined to be the beginning of the voiced cycle and also known as a "glottal close", as the glottis is defined as the space between the folds. The time between one glottal close and the next is referred to as the pitch period, and the reciprocal of the period is the pitch.

A. The Sensor

The GEMS are derived from a general class of Micropower Impulse Radars (MIR), invented by Tom McEwan at LLNL in 1993. Information on the particulars of the operation of this class of sensor may be found in [2] and [3]. The specific sensor used in these experiments was modified to filter out low frequency motion (with a 3-dB frequency of about 70 Hz) and to operate in the near field so as to be sensitive to vocal fold motion-induced vibrations. Other MIR sensors have been modified to detect lower frequencies of motion (up to a few Hz) and these have been used for observation of the jaw and tongue [1]. The GEMS transmits about 10 cycles of a 2.3 GHz wave at a pulse repetition frequency of 2 MHz. It waits a specified (adjustable) time and then mixes the return with a delayed version of the transmitted signal (homodyne detection). The time it waits (the range gate) and the antenna radiation pattern define a "bubble" of sensitivity within which motion is detected. The pulse trains are on the order of a few nanoseconds (ns) long and are repeated every 500 ns, so multiple reflections or overlapping returns from different trains do not cause problems. The GEMS uses a filtering and averaging method so that only changes in the reflectivity of objects in its bubble of sensitivity generate a signal. If there is no motion, there is no AC signal. In essence, the GEMS signal closely mirrors the physical motion of a moving surface if the extent of motion is much less than the wavelength of the pulses (about 12 cm in air for this sensor) and is of sufficiently high frequency (i.e. above ~70 Hz).

B. Use of the sensor in this study

In these experiments, the GEMS antennae (simple rectangular copper foils about 1.5 cm by 0.8 cm) were positioned just below the laryngeal prominence behind a thin (2 mm) plastic case in light contact with the skin (see Figure C.1). The GEMS may be moved farther away from the skin (up to about 6 cm for the present sensor; work is underway to increase that further) and may also be moved up and down the trachea, but to facilitate comparison between individuals the above placement was used.

An example of the GEMS signal along with the corresponding audio is shown in Figure C.2. Notice the rapid fall in the GEMS signal - this is when the vocal folds are in the process of closing. This has been verified through the use of high-speed (3000 frames per second) digital video. A Kodak EktaPro EM 1012 with an Intensified Imager VSG was used with a normal laryngoscope to obtain several 2456-frame videos of the vocal folds in motion. Audio and GEMS signal were recorded simultaneously with the exposure-time signal from the EktaPro controller. At 3000 fps, three frames are recorded in a millisecond. Each frame was exposed for only 30 microseconds, resulting in many clear "snapshots" of the folds for each glottal cycle. Figure C.3 shows the GEMS signal for a single glottal cycle and some of the corresponding video frames. Since the rapid fall occurs as the folds are closing, where it crosses zero (the signal is AC coupled) is defined as the beginning of the glottal cycle. The audio signal feature corresponding to the closure of the folds is detected by a conventional condenser microphone about 1.4 msec later. This delay is due to the travel time of sound through the vocal tract and to the microphone (about 50 cm).

Pitch calculation algorithms

There are many pitch calculation algorithms in use today [4], [5], [6], but all rely on the audio signal alone to determine pitch. The GEMS provide us with a simpler way to find pitch through a direct measurement of the motion of physical structures surrounding the vocal folds. The three methods we chose to compare for this paper are a simple zero crossing (for the GEMS signal), autocorrelation (time domain), and the cepstral (frequency domain).

A. Zero crossings using the GEMS signal

The zero crossing algorithm is quite simple – the zero crossings are calculated by determining where the normalized GEMS signal changes from positive to negative. There is only a single such crossing per cycle regardless of the phoneme being voiced as long as the register does not lapse into vocal fry (for an explanation of vocal registers see [7]). The positive to negative crossing is chosen as it defines the beginning of the glottal cycle (see Figure C.3) and is quite linear as it passes through zero, facilitating the use of linear interpolation for increased accuracy. To obtain the pitch, the length of the signal (in samples) from one zero crossing to the next is determined and then translated to time by division by the sampling frequency. In most experiments two glottal cycles are averaged so that the "pitch windows" are two cycles long, but any number of cycles may be used. The period calculated for the two cycles is inverted to get pitch in frequency:

The GEMS pitch algorithm (a block diagram is given in Figure C.4) uses a fixed window of 35 ms to begin processing the signal. Within that 35 ms window, it looks for enough of a signal to indicate that voicing has occurred. It does this by dividing the summed absolute amplitudes of each normalized window by the length of the window in samples to get the average amplitude. If the average amplitude of the window is below a threshold, a zero pitch is assigned (denoting silence or unvoiced speech) and the window is moved to the next 35 ms of data. If, however, the average amplitude exceeds the threshold, the window is considered voiced. The GEMS is very stable and quiet due to the bandpass filter between 70 Hz and 7 kHz, so voiced speech detection errors (both false positives and negatives) are quite rare. Unless the speaker is voicing (and thus the folds and surrounding tissue vibrating), the GEMS produce very little signal. As the turn-on time for the folds is usually quite rapid, voicing onset and offset may be reliably detected to within a few milliseconds. This ability to measure near-instant voicing onset and offset in a compact, low-power package would be quite useful in speech recognition and coding.

If the window is considered voiced, then a subroutine finds the position in time of all the zero crossings in the window. The first zero crossing is defined as the beginning of the "voiced speech" window, and the pitch calculated from that first zero crossing to the third one. Linear interpolation of the zero crossings is used to increase accuracy, as near the zero crossing the GEMS signal is quite linear. The third zero crossing then becomes the beginning of the next window. Thus the windows are fixed in length at 35 msec until voicing is detected, and then the pitch of the speech dictates the window length. This allows us to do pitch-synchronous processing, enabling more accurate Fourier transforms and decreasing the number of calculations required. If the window is determined to be unvoiced, no further processing is needed.

B. Autocorrelation

We used the clipped autocorrelation pitch detection algorithm described by Rabiner [8]. The data is first lowpass filtered with a 99-point linear phase FIR filter. It is then segmented into 30 ms rectangular windows which are stepped 10 ms at a time, resulting in a 20 ms overlap. Each window is tested to decide if it is voiced or unvoiced by an energy calculation, which is compared to a threshold. If it passes the threshold then the window is center clipped to 68% of the maximum. The autocorrelation is then computed and the location of the first peak that is 30% or more of the correlation at zero is considered to be the pitch period.

C. Cepstral

 

The cepstral method uses the Fourier transform (FT), but it is not a purely frequency based algorithm as it uses the FT twice to get back into a type of time space [4]. The real (as opposed to complex) cepstral proceeds as follows:

The cepstral independent variable exists in the time domain and is known as the "quefrency". In essence, the magnitude of the second power spectrum will have peaks at quefrencies that correspond to repeated peaks in the first power spectrum. The first peak location of sufficient amplitude is defined to be the pitch. Thus a signal with a fundamental at 100 Hz and harmonics every 100 Hz will have peaks in X2 every 100 Hz, and will have a single peak in the cepstrum at a quefrency of 10 ms. As the cepstral involves finding the power spectrum of a power spectrum, it needs a good number of harmonics in order to be effective. In our experiments, we used 40 ms Hamming windows with a 10 ms step, thus a 30ms overlap.

Both conventional methods use peak finding to determine the pitch period. Finding the discrete peak is rather inaccurate at a sampling rate of 10 kHz. By taking the difference of the cepstral or autocorrelation vector as a first approximation to the derivative, we can interpolate and determine where the derivative would cross zero if it were linear. Using this zero crossing we can approximate where the peak is in-between the sampled points.

For both acoustic methods it is also necessary to smooth the initial pitch contours as there can be large deviations in the calculated pitch. For this we used Rabiner’s standard smoothing algorithm [8] utilizing a 3,5 median filter and then a simple Hann linear filter. The GEMS pitch contour (due to its inherent stability and natural acoustic noise immunity) requires no smoothing.

III. Accuracy, stability, and robustness of the GEMS signal and algorithms

A. Tuning fork measurement

 

To demonstrate the accuracy of the GEMS signal, the motion of a vibrating tuning fork was measured with the GEMS and compared to the simultaneously recorded audio. A PC laptop with Labview 4.0 and an A/D from National Instruments were used with a 10 kHz sampling frequency. All analysis was done with Matlab 5.1. Both the GEMS and the microphone were about 1 cm away from a vibrating tine. A portion of the normalized data is shown in Figure C.5. The audio signal is offset by +1 in the y direction to facilitate comparison. The GEMS mirrors the audio signal of the tine almost perfectly. We calculated the pitch using the GEMS zero-crossing algorithm every 2 cycles (about 6.6 ms) for 3 seconds on both the GEMS signal and the audio signal (this is permissible because the audio signal was a pure sinusoid). For the audio the mean was 329.2 Hz with a standard deviation of only 0.1 Hz. The GEMS signal also yielded an average of 329.2 Hz, with a standard deviation of 0.3 Hz. Thus, the experiment shows the GEMS is capable of excellent accuracy even at the low sampling rates commonly used for speech (10 kHz). At higher sampling rates the accuracy would increase, as we are locating an event in time rather than frequency space, and with higher sampling rates comes better time resolution.

B. Comparison of the GEMS algorithm to the autocorrelation and cepstral methods

1. Synthetic signal

In order to directly compare the GEMS method to the audio-only methods outlined above, a synthetic signal s(t) was constructed with a variable fundamental frequency and 15 harmonics of the fundamental (as the cepstral algorithm must have a significant number of harmonics to be effective):

The fundamental frequency fk was varied from 80 Hz to 300 Hz and the sampling rate was defined to be 10 kHz. The length of s(t) was 100 msec and each method was allowed to find as many pitch points as it could in that time period. The mean of those points was defined as the pitch at that frequency, p(fk). The relative error was defined as

The percent error vs. frequency plot is shown in Figure C.6. Note how the GEMS algorithm has the lowest level of error across almost the entire spectrum. The results are similar if interpolation is not used. Interpolation increases the accuracy by about the same amount for all methods.

In addition to being more accurate, the GEMS method also has a significant advantage in terms of cost of computation. In the above example (which was calculated on Matlab 5.1), the number of CPU flops (floating point operations) required to calculate the pitch was computed for all methods. The results are illustrated in table 1, along with the mean of the standard deviation and the error across all fk for the synthetic signal. It is clear that the GEMS method requires far less computational power and is both more accurate and precise than either of the conventional methods. For signals with unvoiced portions, the GEMS cost would be even lower as the GEMS algorithm does no processing on unvoiced windows. The other two methods must process the entire data stream, as they have no information about when the voiced speech begins or ends.

Method

Cepstral

Autocor

GEMS

# kflops

(average)

1100

1800

8

% error (average)

0.27

0.043

0.0083

Std. dev. (average)

0.071

0.153

0.052

Table C.1. Number of kflops required to determine the pitch for a 100 ms synthetic signal in Matlab 5.1, the average error in pitch, and the average standard deviation from the synthetic pitch (80 to 300 Hz) for the three methods.

This low computational cost is due to the simplicity of the time domain-based zero crossing algorithm, which is possible because the signal from the GEMS is clean, relatively simple (with a sharp feature), and is unaffected by acoustic noise.

Speech in quiet and noisy environments

In this noise sensitivity experiment, two male subjects were recorded speaking both a single vowel (/i/) and a sentence ("When all else fails, use force"). The audio and GEMS recording were performed at 40 kHz with no prefiltering on the same equipment used for the tuning fork experiment. The data was then filtered and decimated (using a distortion free digital filter) to 10 kHz. One subject (the "speaker") was approximately 30 cm from the microphone and the GEMS was lightly touching the centerline of the neck directly below the laryngeal prominence. The second subject (the "noise") was seated approximately 60 cm from the microphone, approximating a 12 dB signal to noise ratio. The second subject spoke along with the first but delayed his onset in order to illustrate the difficulties experienced by acoustic-only pitch algorithms when a second speaker is present. All three methods were used to determine the pitch. The results are shown in Figures C.7 and C.8. Note how in the absence of noise (at the beginning of the speech), the three pitch contours are quite similar; in contrast, the introduction of the second (noisy) speaker is quite noticeable in the acoustic methods, which exhibit large errors. The GEMS contour is unaffected. Absolute acoustic noise rejection is one of the foremost attributes of the GEMS method.

IV Conclusion

The GEMS approach has been shown to provide superior pitch information at a very low computational cost compared to that obtained through conventional (audio-only) means. The comparisons were made under ideal conditions – with a stressed speaker or rapid vibrato (where the pitch can change significantly over a few glottal cycles), the instantaneous pitch information available with the GEMS algorithm would be even more advantageous. It is immune to acoustic noise and is non-invasive, safe, and portable. A "double-boom" configuration in which a headset with a microphone boom is modified with a second boom containing the GEMS antenna is being constructed and is one possible method of commercial implementation.

Potential applications for the GEMS’ unique characteristics include vocal training, training for the deaf, vocal stress detection, and prosodical supplementation for speech recognition engines (especially useful for tonal languages). The latter has been advocated in the past [9], but has never been implemented. The GEMS zero crossing algorithm makes low cost, very accurate and robust pitch information (as well as the location of voicing onset/offset times) available for many applications.

This work was performed under the auspices of the U.S. Department of Energy by the Lawrence Livermore National Laboratory under contract No. W-7405-Eng-48 and by the University of California at Davis with support from the National Science Foundation.

References

[1] Holzrichter, J.F., Burnett, G.C., Ng, L.C., and Lea, W.A. (1998). "Speech articulator measurements using low power EM-wave sensors", J. Acoust. Soc. Am. 103(1), 622-625.

[2] McEwan, T.E. (1994). U.S. Patent No. 5,345,471 (1994), U.S. Patent No. 5,361,070 (1994).

[3] McEwan, T.E. (1996). U.S. Patent No. 5,573,012 (1996)

[4] Noll, A. (1966). "Cepstrum pitch detection," J. Acoust. Soc. Am. 41, 293-309.

[5] Rabiner, R., Cheng, M., Rosenberg, A. and McGonegal, C. (1976). "A comparative study of several pitch detection algorithms" IEEE trans. on acoustics, speech and signal processing Vol. 24, 399-418.

[6] Rabiner, R. and Juang, B. (1993). "Fundamentals of speech recognition" (Prentice Hall, New Jersey)

[7] Titze, Ingo R. (1994). "Principles of Voice Production" (Prentice-Hall, Englewood Cliffs, NJ)

[8] Rabiner, R., Sambur, M., and Schmidt, C. (1975). "Applications of nonlinear smoothing algorithm to speech processing" IEEE trans. on acoustics, speech and signal processing, Vol. 23, 552-557.

[9] Lea, Wayne A. (ed.) (1989). "Toward Robustness in Speech Recognition" (Speech Science Publications, Apple Valley MN), see pp. 117-118, 499.

 

 


Figures

c.1.gif (7537 bytes)

Figure C.1. GEMS placement for pitch measurements. Normally light skin contact is made but is not necessary.


c.2.gif (9855 bytes)

Figure C.2. Audio and GEMS signals from 29 year old male native English speaker, voicing /a/ ("ah")


c.3.gif (104142 bytes)

Figure C.3. GEMS signal overlaid with the corresponding high-speed vocal fold video frames. Each bar is 30 microseconds wide and represents the exposure time of the frame.


c.4.gif (6173 bytes)

Figure C.4. Block diagram of the GEMS zero-crossing algorithm.


c.5.gif (7564 bytes)

Figure C.5. Normalized signals from a tuning fork. The audio (upper) is offset in the y direction to facilitate comparison to the GEMS signal (lower).


c.6.gif (8097 bytes)

Figure C.6. Relative error vs. actual pitch for each pitch algorithm. A three second long synthetic signal with multiple harmonics was used. Cepstral (-x), autocorrelation (-o), and GEMS (x).


c.7.gif (9616 bytes)

Figure C.7. Noisy (includes a second male speaker) audio signal (/i/) with pitch contours for GEMS, cepstral, and autocorrelation methods. The GEMS signal is unaffected by the noise.


c.8.gif (10201 bytes)

Figure C.8. Noisy (includes a second male speaker) audio signal ("When all else fails, use force") with pitch contours for GEMS, cepstral, and autocorrelation methods. Again the GEMS signal is unaffected.

Return to Table of Contents