Copyright 1998 ACM ASSETS '98. Appears in
Proceedings of The Third International ACM SIGCAPH Conference on Assistive Technologies, April 15-17, 1998, Marina del Rey, CA, USA

Automatic Babble Recognition
for Early Detection of Speech Related Disorders

Harriet J. Fell
College of Computer Science
Northeastern University
Boston, Massachusetts 02115, USA
Tel: +1-617-373-2198
Joel MacAuslan
Karen Chenausky
Speech Technology and Applied Research
Lexington, Massachusetts 02173, USA
Tel: +1-781-863-0310
Linda J. Ferrier
Department of Speech-Language
Pathology and Audiology
Northeastern University
Boston, Massachusetts 02115, USA
Tel: +1-617-373-5754


We have developed a program, the Early Vocalization Analyzer (EVA), that automatically analyzes digitized recordings of infant vocalizations. The purpose of such a system is to automatically and reliably screen infants who may be at risk for later communication problems. Applying the landmark detection theory of Stevens et al., for the recognition of features in adult speech, EVA detects syllables in vocalizations produced by typically developing six to thirteen month old infants. We discuss the differences between adult-specific code and code written to analyze infant vocalizations and present the results of validity-testing.

1.1 Keywords
infants - pre-speech vocalization - acoustic analysis - early intervention.


There is considerable research to support the position that infant vocalizations are effective predictors of later articulation and language abilities [13, 15, 19, 8]. Intervention to encourage babbling activity in at-risk infants is frequently recommended. However, research and clinical diagnosis of delayed or reduced babbling have so far relied on time-consuming and often unreliable perceptual analyses of tape-recorded infant sounds. While acoustic analysis of infant sounds has provided important information on the early characteristics of infant vocalizations and cry [1, 21] this information has not yet been used to carry out automatic analysis. We are developing a program, EVA, that automatically analyzes digitized recordings of infant vocalizations.


In order to study the babbling or prelinguistic non-cry utterances of typically developing infants, it is necessary to chart an infant's progress and compare it to a developmental framework. Oller [17], and Oller and Lynch [18] describe a suitable framework, comprised of five stages of babbling:
  1. The Phonation Stage (0 to 2 months) quasi-resonant or quasi-vocalic sounds
  2. The Primitive Articulation Stage (1 to 4 months) appearance of primitive syllables combined with quasi-vocalic sounds
  3. The Expansion Stage (3 to 8 months) open vowels, squeals and growls, yells and whispers, raspberries
  4. Canonical Syllable Stage (5 to 10 months) well formed syllables and reduplicated sequences of such syllables (We have found that 6 to 12 months is a more accurate range.)
  5. The Integrative or Variegated Stage (9 to 18 months) meaningful speech, mixed babbling and speech
Infants' well formed syllables are formed of closants (consonant-like phonemes produced by oral cavity constrictions) and vocants(vowel-like phonemes, i.e. voiced, unconstricted segments).


Our first version of EVA was developed on a Macintosh computer using SoundScope by GWInstruments. It proved to be a potentially valid tool for counting the number of infant prelinguistic utterances at age 4 and 6 months, categorizing them into three levels of fundamental frequency, and three lengths of duration [7]. This simple analysis was appropriate for classifying babbles in Oller's "Expansion Stage" (3 to 8 months): open vowels, squeals and growls, yells and whispers, raspberries.

Our human judge and EVA showed better agreement (93%) in counting the number of utterances than the trained phoneticians in the Oller and Lynch study [18]. EVA and our human judge agreed on 80% of their categorizations according to duration, and 87% of their categorizations according to frequency, of the 411 commonly recognized utterances.


Infant pre-speech vocalizations are included in many general infant assessment scales [2, 5] and have been suggested as important indices of early development [20, 14]. Mastery of syllabic utterances with a consonantal boundary appears to be an important developmental step that is a powerful predictor of later communication skills [9, 16]. Since this skill has its roots in cognitive, motor, social, and linguistic domains as well as the sensory area of hearing, it is a sensitive measure of the infant's development.

In the current phase of our project, we are developing tools to classify babbles in Oller's "Canonical Syllable Stage" (6 to 11 months): well formed syllables and reduplicated sequences of such syllables. There are several aspects of these vocalizations that we plan to analyze automatically:

  1. Detecting and Classifying syllables (CV, VC, CVC, V)
  2. Identifying reduplicated sequences of such syllables
  3. Classifying closants by manner of articulation (e.g., stop, fricative, nasal, or liquid)
  4. Classifying vocants as central versus peripheral. [3]
In this paper we report on our work on the first part of this analysis, detecting syllables. We present our methods and the results of validity testing.


Our syllable-detection software is built on a program written by Liu [12] to detect landmarks (ala Stevens [22]) in adult speech. It runs on a SUN Sparc workstation equipped with the Entropic Signal Processing System (ESPS) software library environment with Waves [6].

The Liu-Stevens Landmark Detection Program was developed as a part of an adult-speech recognition system for analysis of continuous speech [11]; that system is founded on Stevens' acoustic model of speech production [23]. Central to this theory are landmarks, points in an utterance around which listeners extract information about the underlying distinctive features. They mark perceptual foci and articulatory targets. The most common landmarks are acoustically abrupt and are associated with consonantal segments, e.g., stop closures and releases. Such consonantal closures are mastered by infants when they produce syllabic utterances. Liu's program was tested on a low-noise (Signal to Noise ratio = 30 decibels) database, LAFF, of four adult speakers speaking 20 syntactically correct sentences. Error rates were determined for three types of landmarks -- Glottis, Sonorant and Burst. The overall error rate was 15%, with the best performance for glottis landmarks (5%).

The Liu-Stevens program first sends speech input through a general processing stage in which a spectrogram is computed and divided into six frequency bands. Then coarse- and fine-processing passes are executed. In each pass, an energy waveform is constructed in each of the six bands, the time derivative of the energy is computed, and peaks in the derivative are detected. Localized peaks in time are found by matching peaks from the coarse- and fine-processing passes. These peaks represent times of abrupt spectral change in the six bands.

In type-specific processing, the localized peaks direct processing to find three types of landmarks. These three types are:

  1. g(lottis), which marks the time when the vocal folds transition from freely vibrating to not freely vibrating or vice-versa.
  2. s(onorant), which marks sonorant consonantal closures and releases, such as nasals.
  3. b(urst), which designates stop or affricate bursts and points where aspiration or frication ends due to a stop closure.
We have modified this program to accommodate the particular acoustic characteristics of infant speech and the signal-to-noise ratio in our recordings.

The first change was to adjust the boundaries of the six frequency bands to better capture abrupt changes in F0, F2, and F3 (F1 is unused). Using ranges for formants in infant vocalizations cited in the literature [10, 4, 9], we set the bounds on the six frequency bands as shown in the table below. Additionally, we created a seventh band, composed of the union of bands three through six, for future work in detecting burst landmarks. (Bursts are not detected in these infant recordings as reliably as they had been in Liu's adult recordings.) See Table 1.

for an Adult Male
for an Infant
F0 ~ 150
To capture F0150-600Hz
F0 ~ 400Hz
F1 ~ 500HzIgnore F1F1 ~ 1000Hz
2800-1500HzFor intervocalic consonantal segments a zero is introduced in this range:

"Bands 2 and 3 overlap in the hope that one of these bands will capture a spectral prominence.[12]

At a sonorant consonantal closure, spectral prominences above F1 show a marked abrupt decrease in energy.

F2 ~ 1500Hz
Onsets and offsets of aspiration and frication will lie in at least one of these four bands.1800-3000Hz
F2 ~ 3000Hz
F3 ~ 2500Hz
F3 ~ 5000Hz
65000-8000HzSpans the remaining frequency up to 8000Hz.6000-8000Hz
7 A threshold might be used on this band to detect +b/ -b landmarks1800-8000Hz

Table 1: Spectral Bands Used for Landmark Detection (adult vs. Infant)

We use a high-pass filter (cutoff: 150Hz) on our source files before applying the landmark program. This lowers the interference from ambient noise.

We were not satisfied with the initial marking of voicing (+g/ -g) done by Liu's program on our digitized samples. This algorithm works by looking only at the energy in Band 1. It assumes that high energy in this band indicates the presence of voicing. Our infant vocalizations usually exhibited lower energy than the adult male samples in the LAFF and TIMIT databases used by Liu. The ESPS get_f0 function [6] (which measures periodicity to calculate F0 in voiced regions) appeared to be more reliable at finding the voiced parts of the infant signals. We integrated this with Liu's program by multiplying Liu's coarse-pass Band 1 energy by the "probability of voicing" returned by get_f0 (a 0/1 value).

Experimentation with get_f0 resulted in settings of Min F0 = 150Hz and Max F0 = 1200Hz. (An earlier study [7], of four infants found an average F0 between 290Hz and 320Hz.) Utterances with fundamental frequency in the range were handled appropriately by the landmark software. We did not attempt to handle squeals, i.e. vocalizations with F0 > 1200. The lower threshold of 150Hz was sufficient to filter out noise. This left sounds that might be classified as growls but that were nonetheless suitable for analysis by the landmark program.

The program applies a variety of rules to check and possibly modify the initial +g/ -g settings. Adult rules and infant rules differ for two reasons. In data samples of an adult male reading single sentences, no pauses were expected. So the original code inserted a +g/-g into any -g/ +g interval of duration greater than 150ms. In a ten-second sequence of infant babbles, there are likely to be pauses that are at least 150ms long, so we adopted 350ms for this insertion rule. We adjusted thresholds related to vocalic energy levels to accommodate the faint babbles uttered by some of our infants.


We measured the extent to which the landmarks found by EVA to mark syllable boundaries corresponded with distinctions made by trained human listeners (the judges).

We started with an inter-judge reliability study to assure the consistency of independent hand-marking of spectrograms by the judges. We then conducted a small study comparing the results of EVA to the landmarks agreed on by the judges.

7.1 Subjects
Five subjects were enlisted into the study, four typically developing and one with hydrocephaly, surgically rectified immediately after birth but with some slight gross motor delay. The typically developing subjects comprised three male and one female; one male was African American. All infants had English-speaking parents. (In addition, we collected one sample from each of three typically developing children at ages 12, 13 and 14 months to allow working on the software and testing it on a small number of syllabic utterances.) See Table 2.

TWboyx xx x 
EKboyx xx xx
NZboy xx xx 
JDboy   xxxx
LSgirl xx xx 

x - recorded

Table 2: Subjects and Months Recorded

7.2 Data collection procedures
Parents of the normal infants interested in participating in the study were initially contacted by phone and biographical data was collected. Infants with early medical problems, including histories of middle ear infections, were eliminated.

Parents filled out an Infant State Form to testify that their infant had followed a normal schedule and was in an appropriate mood for recording. Infants whose parents reported their infants were in an atypical state were requested to reschedule. The infants were then recorded interacting with their parents in a sound-proof booth. Digital recordings were made using a high quality lavaliere microphone with a wireless transmitter and receiver. The following equipment was used for recording: