Interspeech 2007 Session FrC.O3: Pitch extraction II
Type
oral
Date
Friday, August 31, 2007
Time
13:30 – 15:30
Room
Marble
Chair
Keikichi Hirose (University of Tokyo)
FrC.O3‑1
13:30
A Fine Pitch Model for Speech
Jasha Droppo, Microsoft Research
Alex Acero, Microsoft Research
An accurate model for the structure of speech is essential to many speech processing applications, including speech enhancement, synthesis, recognition, and coding. This paper explores some deficiencies of standard harmonic methods of modeling voiced speech. In particular, they ignore the effect of fundamental frequency changing within an analysis frame, and the fact that the fundamental frequency is not a continuously varying parameter, but a side effect of a series of discrete events. We present an alternative, time-series based framework for modeling the voicing structure of speech called the fine pitch model. By precisely modeling the voicing structure, it can more accurately account for the content in a voiced speech segment.
FrC.O3‑2
13:50
Pitch Period Estimation using Multipulse Model and Wavelet Transform
Prasanta Kumar Ghosh, Student, Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089
Antonio Ortega, Professor, Signal and Image Processing Institute, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089
Shrikanth Narayanan, Professor, Speech Analysis and Interpretation Laboratory, Department of Electrical Engineering, University of Southern California, Los Angeles, CA 90089
Wavelet transform-based pitch period estimation is well known in the literature. This approach to pitch estimation assumes that the glottis closures are correlated with the maxima in the adjacent scales of the wavelet transform and for pitch period estimation, one needs to detect these correlated maxima across these scales, which is often prone to error especially in the case of noisy signals. In this paper, we develop an optimization scheme in the wavelet framework using a multipulse excitation model for the speech signal and the pitch period is estimated as a result of this optimization. We report experiments on both clean and noisy conditions and show that the proposed optimization works better than wide used heuristic approach for maxima detection.
FrC.O3‑3
14:10
Combining Rate and Place Information for Robust Pitch Extraction
Martin Heckmann, Honda Research Institute Europe GmbH
Frank Joublin, Honda Research Institute Europe GmbH
Christian Goerick, Honda Research Institute Europe GmbH
In this paper we propose an algorithm for the robust extraction of pitch combining both temporal (rate) and pattern matching (place) techniques. Following a transformation into the spectral domain via the application of a Gammatone filter bank the rate information is extracted in each band via the zero crossing distances in that band. Next a comb filter with teeth at the harmonics of the current fundamental frequency hypothesis is set up, reflecting the pattern matching aspect. The signals below the teeth of the comb filter are analyzed upon consistency. This yields an allocation pattern for the filter. The current allocation pattern is compared to prototypical ones allowing the suppression of side peaks at harmonics and sub-harmonics of the true fundamental. A comparison to a state of the art autocorrelation based algorithm is performed showing significantly better results for our algorithm.
FrC.O3‑4
14:30
Integrating pitch and localisation cues at a speech fragment level
Heidi Christensen, University of Sheffield
Ning Ma, University of Sheffield
Stuart Wrigley, University of Sheffield
Jon Barker, University of Sheffield
This paper proposes a novel speech-fragment based approach for processing binaural data to improve the estimation of speech source locations in reverberant, multi-speaker recordings. The technique employs two stages. First, a robust multi-pitch tracking algorithm is used to locate local spectro-temporal `speech fragments' - regions where the energy in the mixture is dominated by a single speech source. Second, robust localisation estimates are formed by integrating interaural time difference cues over each speech fragment. The technique is applied to the analysis of more than five hours of two-party meetings that have been constructed from a mixture of binaural mannequin recordings. It is shown that estimating location at the speech fragment level produces better results than conventional location-estimate smoothing techniques leading to a an increase in relative frame accuracy rate of more than 35%.
FrC.O3‑5
14:50
Speech fundamental frequency estimation using the Alternate Comb
Jean-Sylvain Lienard, LIMSI-CNRS
Francois Signol, LIMSI-CNRS
Claude Barras, LIMSI-CNRS and Paris XI University
Reliable estimation of speech fundamental frequency is crucial in the perspective of speech separation. We show that the gross errors on F0 measurement occur for particular configurations of the periodic structure to be estimated and the other periodic structure used to achieve this estimation. The error families are characterized by a set of two positive integers. The Alternate Comb method uses this knowledge to cancel most of the erroneous solutions. Its efficiency is assessed by an evaluation on a classical pitch database.
FrC.O3‑6
15:10
Detecting Pitch Accent Using Pitch-corrected Energy-based Predictors
Andrew Rosenberg, Columbia University
Julia Hirschberg, Columbia University
Previous work has shown that the energy components of frequency subbands with a variety of frequencies and bandwidths predict pitch accent with various degrees of accuracy, and produce correct predictions for distinct subsets of data points. In this paper, we describe a series of experiments exploring techniques to leverage the predictive power of these energy components by including pitch and duration features - other known correlates to pitch accent. We perform these experiments on Standard American English read, spontaneous and broadcast news speech, each corpus containing at least four speakers. Using an approach by which we correct energy-based predictions using pitch and duration information prior to using a majority voting classifier, we were able to detect pitch accent in read, spontaneous and broadcast news speech at 84.0%, 88.3% and 88.5% accuracy, respectively. Human performance at pitch accent detection is generally taken to be between 85% and 90%.