Engineering: Vowel Landmark Detector implementation
Engineering: Vowel Landmark Detector implementation
Scoring is complicated by limitations of the acoustic transcription.
For training, reference landmarks were generated from the acoustic
transcription, and scoring was done by comparing detected landmarks to
the reference landmarks, resulting in Token Error Rate (TER).
Mermelstein's original syllable detector was implemented, and its
performance on TIMIT (26.2% TER) was substantially worse than
Mermelstein reported (9.5% error), probably because of the more
comprehensive data set. Changing the original energy band (500 to
4000 Hz) to the F1 range (300 to 900 Hz) reduced error to 13.4% TER,
and allowed elimination of post processing to detect fricatives,
simplifying the algorithm.
Unlike Mermelstein's syllable detector, the VLD must generate
confidence scores for landmarks (not just a binary decision). The
three acoustic cues (peak-to-dip value, duration, and level) must be
combined for the confidence score. Manual combinations (linear and
nonlinear) were found to be difficult to optimize and not
satisfactory. A neural net (multi layer perceptron, or MLP) was found
to be adequate with one hidden layer of two units.
The MLP was optimized in two stages. First it was optimized using
back propagation, with a minimum mean-squared error metric. Then the
result was reoptimized using gradient descent, to minimize TER (which
is not differentiable).
When evaluated using canonical error categories (strict detection,
insertion, and deletion), the final version of the VLD yields an error
rate of about 38% (24% deletions, 14% insertions). When evaluated
using more appropriate error categories (not counting deletions in VV
context, and allowing skewed detections), the error rate drops to
about 12% (8% deletions, 4% insertions).
Skewed detections happen mostly in semivowels, but also in obstruent
consonants (1/4 of all skewed detections), and account for about 75%
of all strict insertion errors.
Deletions in VV context account for about 1/5 of all deletions.
Confidence scores can be generated either from the output of the MLP
alone (scaled to fall between 0.0 and 1.0), or from a combination of
the MLP output with the hard limit (using 1.0 when the decision is
made by hard limit, and using the MLP output when the decision is made
by the MLP). Neither is entirely consistent. Further work is needed
on this area, either via a separate confidence estimator, or by rework
of the convex hull algorithm.