EuroTopics 99

Notes on EuroSpeech 99, arranged by topic

Here are my notes on some of the presentations at the EuroSpeech 99 conference, arranged by topic. Related web pages include: The topic headings in this document are:

Miscellaneous

S1.PO2.11 Estimating velum height from acoustic data. Uses magnetometer data similar to EMMA. [EMMA]

Session S4.PO2 includes several interesting papers on prosodic measurements and data structures. [Prosody]

S9.PO3.2 When speech is transcribed in a noisy environment, grammatical context overrides speech information, and accuracy decreases (cf Hollein 92). [Sensimetrics]

S10.PO2.11 Prosodic modeling for emotional content in Spanish synthesizer (concatenative synthesis). [Prosody]

S11.PO3.4 Microphone arrays for speech acquisition, using a novel spatial reference beamformer. [Sensimetrics]

PsychoAcoustics

S8.PO3 and S9.PO3 are on Speech Perception, including many interesting posters.

S9.PO3.5 Formant transition masking in noise. Perception tends not to depend on the change of frequency, only on energy and final frequency.

S11.OR2.2 Separating speech & music signals, using auditory modeling and iterative analysis by synthesis. Application to computational auditory scene analysis (CASA).

Acoustic Phonetics

S3.OR3.1 uses phonetograms to display acoustic register data. Chest voice is relatively low F0 and high intensity, falsetto is relatively high F0 and low intensity.

S8.PO3.7 Changes in F0 do not affect the ability of listeners to discriminate formants.

S9.PO3.10 Priming phenomena provide evidence for caching during production.

Lexical Access From Features [LAFF]

S1.PO1.5 Syllable Onset Detection. They use a Multi Layer Perceptron (MLP) to detect Portuguese syllable onsets. References Mike Shire & Steve Greenberg's work.

Session S2.PO1 is all about confidence measures for speech recognition. This is a valuable topic for LAFF work.

S2.PO1.14 compares three measures: Word graph (requires endpoints), N-best (doesn't require endpoints, at cost of some increase in computation, but otherwise better than word graph), and acoustic confidence (mentioned, but no details).

S2.PO1.10 emphasizes the importance of vocal tract length normalization before any feature extraction.

S3.OR3.5 is a nice presentation of how to quantify reduction effects (save the talker effort, subject to the needs of the listener). Uses N-gram for prediction. Duration is more affected than other parameters.

S6.OR2.1 demonstrates formant estimation using HMMs. Inverse filtering and resynthesis are among the uses, demos were very nice.

S6.PO1.7 combines disparate knowledge sources with a blackboard architecture.

S7.PO3.2 is Bill Edmondson's presentation of Pseudo Articulatory Representations for the speech signal. This is a conceptual scheme very similar to LAFF, in which the level of information representation just above the acoustic signal is an ensemble of feature tracks (8 of which are continuous in time and value, and 4 are binary: labial, coronal, strident, anterior). The level above that seems to be syllable templates with slots that hold values corresponding to the feature tracks (I'm not sure exactly how). I spent a long time talking with him about the implications of this scheme versus LAFF, and commiserating about the uphill battle we both face. I will definitely be staying in touch with him.

S8.PO3.4 Epenthetic insertion of vowels is prelexical, not a lexical phenomenon.

S8.PO3.6 Human labelers are not particularly consistent when segmenting syllables in the speech signal.

S9.PO3.7 Evidence for symbolic phonemic coding before lexical contact.

S10.PO2.8 Innovative matching technique for orthography to phoneme conversion.

Speech Synthesis

S1.PO2.2 KTH poster on 3D modeling of the vocal tract from acoustic data. Combines Jay Moody's face synthesis with acoustic synthesis (laminar flow model) to produce 3D vocal tract shapes. Nice pictures! Interestingly, they use a total of about 10 parameters to describe the articulator movements, same number as HLSYN, though different parameters.

S6.OR.3 demonstrates vowel synthesis with variation in formant bandwidths within the pitch period (from subglottal loss and resonance) leading to FM and AM modulation of the resonators. Includes an excellent audio demo of the improved naturalness of synthesized vowels.

S10.PO2.4 Rule based synthesis, but the rules are generated, tested and modified automatically. Excellent combination of acoustic modeling with statistical systems!

S10.PO2.17 Neural network found to be better than rule based system for generating pronunciations, for most tasks.

S11.PO2.8 Speech synthesis using nonlinear filters, like a neural net approach but allowing inspection (not a black box).

S11.PO3.11 Hybrid of concatenative synthesis and LPC, using analysis to smooth parameters and resynthesize.

Speech Coding

S6.OR2.2 and S6.OR2.4 are modifications to McAulay & Quatieri's sinusoidal transform coding. 2 examines iterative analysis by synthesis as an alternative to FFT peak picking, while 4 investigates pitch and time scaling.

S7.PO3.8 and S7.PO3.15 both implement variable rate codecs, via frame redundancy coding or variable size frames. (This is relevant to my work for VCT.)

Session S8.OR3 Wideband and Perceptual Coding is relevant for MPEG and other subband coding schemes. All of these papers are pretty interesting (to me anyway).

S8.OR3.1 Harmonic encoding at MOS 3.0 (compare to G.722 at 48 kbps, MOS 3.16). Innovative use of extra spectral shape data for high F0, when there are few harmonics to carry information.

S8.OR3.2 Multiband CELP (8 bands, fixed codebooks) with dynamic bit allocation, using perceptual model for masking thresholds.

S8.OR3.3 Nonlinear auditory model, providing substantial noise reduction. Good quality at 6 kbps, toll quality at 2 kbps. Good demo!

S8.OR3.4 MPEG style high quality music encoding. Dynamic bit allocation with joint optimization of quality & bit rate.

S8.OR3.5 Auditory model includes cochlear filtering and auditory nerve model. MOS 3.5 at 3.25 kbps, with no degradation up to SNR 15 dB. See www.ee.usf.hk/~eean for samples.

S10.PO3.8 Recognition of encoded speech is poor, matching system parameters such as frame rate is important.

S11.OR2.3 Automatic music detection and separation from speech sources. Generalized Markov model (for combination of features) works much better than any simple linear combination.

S11.PO3.2 Noise invariant representation for speech signals, using modified group delay without loss of resolution.

This page maintained by Wil Howitt
Last updated 18 September 1999