EuroNotes 99

Notes on EuroSpeech 99, in chronological order

Here are my notes on presentations at the EuroSpeech 99 conference, in chronological order. Related web pages include: Much of this conference is devoted to speech recognition systems, almost all HMMs or neural nets with various tweaks. While I find HMMs per se not particularly interesting, the evaluation methods and development tools will probably come in handy for LAFF work, and in any case, we should be aware of what's available.

What follows are the papers that I found of particular interest, for whatever reason.

Monday, 6 September: Conference day 1

Keynote speech: Fred Jelinek, "Putting Language into Language Modeling." He talked about adding structured grammars to the classic trigram model. Very good, a step in the right direction from my point of view.

S1.PO1.5 Syllable Onset Detection. They use a Multi Layer Perceptron (MLP) to detect Portuguese syllable onsets. References Mike Shire & Steve Greenberg's work. [LAFF]

S1.PO2.2 KTH poster on 3D modeling of the vocal tract from acoustic data. Combines Jay Moody's face synthesis with acoustic synthesis (laminar flow model) to produce 3D vocal tract shapes. Nice pictures! Interestingly, they use a total of about 10 parameters to describe the articulator movements, same number as HLSYN, though different parameters. [Synthesis]

S1.PO2.11 Estimating velum height from acoustic data. Uses magnetometer data similar to EMMA. [EMMA]

Session S2.PO1 is all about confidence measures for speech recognition. This is a valuable topic for LAFF work.

S2.PO1.14 compares three measures: Word graph (requires endpoints), N-best (doesn't require endpoints, at cost of some increase in computation, but otherwise better than word graph), and acoustic confidence (mentioned, but no details). [LAFF]

S2.PO1.10 emphasizes the importance of vocal tract length normalization before any feature extraction. [LAFF]

S3.OR3.1 uses phonetograms to display acoustic register data. Chest voice is relatively low F0 and high intensity, falsetto is relatively high F0 and low intensity. [AcouPhon]

S3.OR3.5 is a nice presentation of how to quantify reduction effects (save the talker effort, subject to the needs of the listener). Uses N-gram for prediction. Duration is more affected than other parameters. [LAFF]

Tuesday, 7 September: Conference day 2

Keynote speech: Maria Gosy, "The connection between speech production and perception." She presents lots of data on speech disorders, much of which seems to contradict the motor theory idea that perception (or at least comprehension) is closely coupled to production.

Session S4.PO2 includes several interesting papers on prosodic measurements and data structures.

S6.OR2.1 demonstrates formant estimation using HMMs. Inverse filtering and resynthesis are among the uses, demos were very nice. [LAFF]

S6.OR2.2 and S6.OR2.4 are modifications to McAulay & Quatieri's sinusoidal transform coding. 2 examines iterative analysis by synthesis as an alternative to FFT peak picking, while 4 investigates pitch and time scaling. [Coding]

S6.OR.3 demonstrates vowel synthesis with variation in formant bandwidths within the pitch period (from subglottal loss and resonance) leading to FM and AM modulation of the resonators. Includes an excellent audio demo of the improved naturalness of synthesized vowels. [Synthesis]

S6.PO1.7 combines disparate knowledge sources with a blackboard architecture. [LAFF]

S7.PO3.2 is Bill Edmondson's presentation of Pseudo Articulatory Representations for the speech signal. This is a conceptual scheme very similar to LAFF, in which the level of information representation just above the acoustic signal is an ensemble of feature tracks (8 of which are continuous in time and value, and 4 are binary: labial, coronal, strident, anterior). The level above that seems to be syllable templates with slots that hold values corresponding to the feature tracks (I'm not sure exactly how). I spent a long time talking with him about the implications of this scheme versus LAFF, and commiserating about the uphill battle we both face. I will definitely be staying in touch with him. [LAFF]

S7.PO3.8 and S7.PO3.15 both implement variable rate codecs, via frame redundancy coding or variable size frames. (This is relevant to my work for VCT.) [Coding]

Wednesday, 8 September: Conference day 3

Keynote speech: Mark Maybury, "Multimedia Interaction for the New Millenium." Fun to watch, lots of cool grand ideas, but not a lot of specifics for us.

Session S8.OR3 Wideband and Perceptual Coding is relevant for MPEG and other subband coding schemes. All of these papers are pretty interesting (to me anyway).

S8.OR3.1 Harmonic encoding at MOS 3.0 (compare to G.722 at 48 kbps, MOS 3.16). Innovative use of extra spectral shape data for high F0, when there are few harmonics to carry information. [Coding]

S8.OR3.2 Multiband CELP (8 bands, fixed codebooks) with dynamic bit allocation, using perceptual model for masking thresholds. [Coding]

S8.OR3.3 Nonlinear auditory model, providing substantial noise reduction. Good quality at 6 kbps, toll quality at 2 kbps. Good demo! [Coding]

S8.OR3.4 MPEG style high quality music encoding. Dynamic bit allocation with joint optimization of quality & bit rate. [Coding]

S8.OR3.5 Auditory model includes cochlear filtering and auditory nerve model. MOS 3.5 at 3.25 kbps, with no degradation up to SNR 15 dB. See www.ee.usf.hk/~eean for samples. [Coding]

S8.PO3 and S9.PO3 are on Speech Perception, including many interesting posters.

S8.PO3.4 Epenthetic insertion of vowels is prelexical, not a lexical phenomenon. [LAFF]

S8.PO3.6 Human labelers are not particularly consistent when segmenting syllables in the speech signal. [LAFF]

S8.PO3.7 Changes in F0 do not affect the ability of listeners to discriminate formants. [AcouPhon]

S9.PO3.2 When speech is transcribed in a noisy environment, grammatical context overrides speech information, and accuracy decreases (cf Hollein 92). [Sensimetrics]

S9.PO3.5 Formant transition masking in noise. Perception tends not to depend on the change of frequency, only on energy and final frequency. [PsycAcou]

S9.PO3.7 Evidence for symbolic phonemic coding before lexical contact. [LAFF]

s9.PO3.10 Priming phenomena provide evidence for caching during production. [AcouPhon]

Thursday, 9 September: Conference day 4

Keynote speech: Bjorn Lindblom, "How Speech Works." Interesting examination of speechlike phenomena in animals (vervet vocabulary, japanese quail experiments). Posits a joint optimization effect, maximizing novel information transmitted to listener while minimizing talker effort, combining motor theory with auditory coding notions.

S10.PO2.4 Rule based synthesis, but the rules are generated, tested and modified automatically. Excellent combination of acoustic modeling with statistical systems! [Synthesis]

S10.PO2.8 Innovative matching technique for orthography to phoneme conversion. [LAFF]

S10.PO2.11 Prosodic modeling for emotional content in Spanish synthesizer (concatenative synthesis). [Prosody]

S10.PO2.17 Neural network found to be better than rule based system for generating pronunciations, for most tasks. [Synthesis]

S10.PO3.8 Recognition of encoded speech is poor, matching system parameters such as frame rate is important. [Coding]

S11.OR2.2 Separating speech & music signals, using auditory modeling and iterative analysis by synthesis. Application to computational auditory scene analysis (CASA). [PsycAcou]

S11.OR2.3 Automatic music detection and separation from speech sources. Generalized Markov model (for combination of features) works much better than any simple linear combination. [Coding]

S11.PO2.8 Speech synthesis using nonlinear filters, like a neural net approach but allowing inspection (not a black box). [Synthesis]

S11.PO3.11 Hybrid of concatenative synthesis and LPC, using analysis to smooth parameters and resynthesize. [Synthesis]

S11.PO3.2 Noise invariant representation for speech signals, using modified group delay without loss of resolution. [Coding]

S11.PO3.4 Microphone arrays for speech acquisition, using a novel spatial reference beamformer. [Sensimetrics]

I did not get to see any of the afternoon presentations, because I set up my poster at the beginning of the afternoon slots, and it got enough attention that I stayed there, giving overviews and answering questions, straight through both sessions. Most people I spoke to were interested and positive, quite a few were familiar with LAFF and expressed approval that the project was moving forward. A few attacked the LAFF idea, calling it primitive, outdated, and out of touch with the conventional wisdom of statistical modeling. Fine with me, at least we're attracting attention!

This page maintained by Wil Howitt
Last updated 18 September 1999