Engineering: Vowel Landmark Detector implementation

Engineering: Vowel Landmark Detector implementation

Scoring technique

Scoring is complicated by limitations of the acoustic transcription. For training, reference landmarks were generated from the acoustic transcription, and scoring was done by comparing detected landmarks to the reference landmarks, resulting in Token Error Rate (TER).

Baseline experiment

Mermelstein's original syllable detector was implemented, and its performance on TIMIT (26.2% TER) was substantially worse than Mermelstein reported (9.5% error), probably because of the more comprehensive data set. Changing the original energy band (500 to 4000 Hz) to the F1 range (300 to 900 Hz) reduced error to 13.4% TER, and allowed elimination of post processing to detect fricatives, simplifying the algorithm.

Combining acoustic cues

Unlike Mermelstein's syllable detector, the VLD must generate confidence scores for landmarks (not just a binary decision). The three acoustic cues (peak-to-dip value, duration, and level) must be combined for the confidence score. Manual combinations (linear and nonlinear) were found to be difficult to optimize and not satisfactory. A neural net (multi layer perceptron, or MLP) was found to be adequate with one hidden layer of two units.

Incorporating MLP into Detector

The MLP was optimized in two stages. First it was optimized using back propagation, with a minimum mean-squared error metric. Then the result was reoptimized using gradient descent, to minimize TER (which is not differentiable).

Evaluation and error analysis

When evaluated using canonical error categories (strict detection, insertion, and deletion), the final version of the VLD yields an error rate of about 38% (24% deletions, 14% insertions). When evaluated using more appropriate error categories (not counting deletions in VV context, and allowing skewed detections), the error rate drops to about 12% (8% deletions, 4% insertions).

Skewed detections happen mostly in semivowels, but also in obstruent consonants (1/4 of all skewed detections), and account for about 75% of all strict insertion errors. Deletions in VV context account for about 1/5 of all deletions.

Confidence scoring

Confidence scores can be generated either from the output of the MLP alone (scaled to fall between 0.0 and 1.0), or from a combination of the MLP output with the hard limit (using 1.0 when the decision is made by hard limit, and using the MLP output when the decision is made by the MLP). Neither is entirely consistent. Further work is needed on this area, either via a separate confidence estimator, or by rework of the convex hull algorithm.