Initial Audio Annotation Session

Posted by deanna

Liban and I recently completed our first audio annotation session, in which we applied a set of controlled prosodic terms to a selection of audio files prepared from our corpus. Our goal is to identify a large enough data set for each prosodic feature so that Frank Rudicisz can write an algorithm to automatically identify these in any given audio recording.

As a test set of data, we created 5-minute selections of audio from six different readings: George Bowering (1974), Robert Creeley (1967), Irving Layton (1967), Margaret Atwood (1974), Dorothy Livesay (1971) and Gwendolyn Macewen (1966). These files were edited using the open source software Audacity, omitting all extrapoetic speech so that only the readings of the poems remained.

To annotate the audio, we opened the files in Praat, a free scientific computer software package for the analysis of speech in phonetics. Praat generates a text grid that allows the user to annotate the audio on several different tracks or “tiers”, which for our purposes were “Text” (i.e. the text of the poem being read), “Pitch”, “Amplitude”, “Tempo”, “Quality” and “Symmetry.” In those fields, the following annotations were used:



Rising Inflection

Falling Inflection

Peaking Inflection (rise/fall)

Dipping Inflection (fall/rise)




Pitch Variation



Fast Tempo

Slow tempo






Force (pronounced force)








Rolled R



(This was a category we added in as we went along, and I’m not sure how useful it will be, but we’re using it here to compare lines or line segments that are delivered in similar or contrasting ways, i.e. with the repetition of a phrase or chorus, etc.)

Parallel Symmetry

Mirrored Symmetry


Some of the terms were fairly self-apparent and easy to identify, while others were harder to pin down. The first thing we noticed was that it was very difficult to annotate for more than one category at a time, and so we ended up doing one listen-through just for pitch, and another for amplitude and tempo. The other big question was how long of a segment of audio to annotate and whether it’s more helpful to have precise, one-word examples or line-length, sustained examples for development of our algorithms. This often led to some tension between wanting to annotate a single word as having a rising inflection vs. a phrase as having a peaking (rise/fall) inflection, etc. On solution to this problem might be having two pitch tiers instead of one, one for word-length pitch modulation and one for sustained.


Another question was how often to annotate, and how precisely or loosely an example should fit into our predetermined categories in order to be worth annotating. For example, if a word or phrase has a slight dip in the middle of its otherwise monotone delivery, should it still be used as an example?


Going forward, one question that came up was whether or not participants should be able to see the sound waves and pitch curves in Praat, or if they should be relying explicitly on their own aural interpretation. I found that looking at the pitch curve often swayed my interpretation of the audio, and had me inadvertently “skipping forward” visually to see if there were significant variations in the pitch curve etc. It might be an interesting control to have half our participants be able to see the visual representation of sound and half do a “blind listening.”

Filed under Uncategorized