Speaker:
Robert Rodman, Computer Science, NCSU
Abstract: The core of the talk will be a presentation of a new speaker recognition methodology. The methodology was originally developed for the purposes of automatic lip synchronization by the authors and their students over the past six years at North Carolina State University. We analyze the spectra of individual glottal pulses. The spectra are heavily processed to remove noise, scaled and normalized so they can be compared with one another. The smoothed, noise free spectra are treated as if they were probability density functions, and statistical moments are computed as a measure of their shapes. Two such measures, the first moment (mean) and the second central moment (variance), are used to plot points for fragments of speech sounds in a two-dimensional space.
Experimentation has demonstrated to us a high degree of intra-speaker consistency. For example when the same speaker utters "owie," (/auwi/) on different occasions, the paths in the two-dimensional space are nearly the same. However these paths appear to be different for different speakers. Furthermore, when we look at single points obtained from averaging the value for a few glottal pulses taken in the middle of various sounds such as /i/, /z/ and /r/, we also observe lower intra-speaker variation than inter-speaker variation.
Part of what we will report on will be the results of quantifying these informal observations to see if statistically valid methods for speaker recognition can be derived from this approach. For example we will look at metrics for determining the "distance" between two curved shapes in the plane. We will also examine such geometric parameters as the convex hull of points taken from an aggregate of voiced sounds from different speakers to see if it can be the basis of a statistically reliable computation for speaker recognition.
If the basic methodology proves viable, we will test it on disguised speech of some type or types to be determined. For example we hypothesize that whereas certain types of disguise will cause changes in the spectrum, the basic shape of an individual's spectrum for a given sound may remain constant, or undergo a shift that can be accounted for in doing speaker recognition. An informal study showed that "owie" spoken naturally, spoken in falsetto, spoken with the nostrils pinched, and whispered have consistent paths within disguise type, and all start and end in approximately the same region.
This talk is based on the joint work with D. McAllister, D. Bitzer, H. Fu, L. Cepeda, and M. Powell.
Short Bio: See the home page under Robert D. Rodman.