Presentation ] Features ] Tutorial ] [ Lip-Sync ] Gallery ] Library ] Download ] Credits ] 

Lip-Sync

Lipsync for Facial Animation
by Dave Merrill

Tools for generating phonetic transcriptions of audio files
Generating an Alignment
Concise Summary of making Facade Lipsync
WorldBet
Coarticulation Ideas

Tools for generating phonetic transcriptions of audio files

Magpie - a tool for manual phonetic transcription
BaldiSync - available as part of the CSLU Toolkit for speech research.
Discussion: Magpie is a good tool for doing manual phonetic transcriptions. The not-free version of magpie also has some integration features with various animation software. Baldisync is part of a larger toolkit for speech research and application development. Baldisync does automatic phonetic transcription via a forced-alignment procedure (uses speech-recognition techniques on the .wav file to discover the phoneme boundaries).

Generating an alignment

In order to generate an alignment, Baldisync needs a .wav file, and a transcription of the speech contained within the .wav file. Baldisync can output a .sob file, which is a binary format that contains both the audio data as well as transcription information. Click here to download lerp.tcl, a Tcl/Tk script which will allow you to choose a .sob file, then generate a .lips file which contains mouth parameters for every frame (assuming 30fps) of your animation. It will also generate a .wav file which is named correctly. (note - the libraries needed to run this TCL script are installed with the CSLU toolkit, but are not part of a default Tcl/Tk installation). I used the following .wav file which comes with magpie as an example: vista.wav: a recording of Arnold, saying "Hasta la Vista, Baby" The intermediate alignment looks like this:
MillisecondsPerFrame: 1.0
END OF HEADER
0 10 .pau
10 140 .pau
140 170 h
170 260 @
260 310 s
310 320 tc
320 370 th
370 390 &
390 510 .pau
510 620 l
620 630 A
630 670 .pau
670 740 v
740 860 I
860 980 s
980 1020 tc
1020 1080 th
1080 1120 ^
1120 1630 .pau
1630 1640 bc
1640 1650 b
1650 1840 ei
1840 1900 bc
1900 1910 b
1910 2050 i:
2050 2507 .pau
There's also a word-level transcription, so you can tell 
where the word boundaries are in the above transcription:

MillisecondsPerFrame: 1.0
END OF HEADER
140 510 hasta
510 670 la
670 1630 vista
1630 2507 baby
I'm not sure that the word-level information would be as useful - it could be used for some kind of "follow the bouncing ball" visual display of what the face is currently saying.

For the purposes of animation, the output you care about will be the .lips file and the .wav file generated by lerp.tcl. The .lips file contains speech-related parameters and their values for every frame of the utterance. The .wav file contains the actual sound data, which the program will play at animation-time. Both files need to go into the Anims folder located in your project directory. Then, you can enable lipsync, by typing into the GUI:

le myspeech.lips

where "myspeech.lips" is the name of your .lips file. You need to have an animation which is long enough to allow your utterance to play, and then you're ready to go!

Here's what the .lips file looks like, in case you're interested:

k
p4 -0.5
p14 0.0
p15 0.0
p16 0.00670219
p17 0.0
p63 0.0
p64 0.0288841
p65 0.0
l66 0.0
r66 0.0
p67 0.0229465
p68 0.0
p70 0.0233635
p71 0.0
p72 0.0
p73 0.0
p74 -0.0267606
p75 0.020504
p76 -0.0372015
p77 -0.0345186
p78 -0.00849802
k
p4 -0.5
p14 0.0
p15 0.0
p16 0.00670219
...
The k's on a line by themselves delimit one frame of animation from the next. Within a single frame is information about the parameters involved in lipsync, and the associated values for these parameters.

Concise Summary of making Facade Lipsync

If you don't have the CSLU toolkit yet, but you'd like to try out the lipsync capabilities of facade, you can download hasta.lips and hasta.lips.wav . Put them both into the Anims folder, and start with the "Enable Lipsync" step below.
Download and install the CSLU speech toolkit, if you haven't done so already.
Record a .wav file
Generate a .sob file from your .wav file (and the text of whatever is being said in the .wav file) in the BaldiSync program which comes with the toolkit.
Run the lerp.tcl Tcl/Tk script to generate a .lips file which contains lipsync params for every frame of the utterance, as well as a .wav file which is named correctly. Put your .lips file and the .wav file into the Anims folder.
Enable lipsync by typing "le mylipsfile.lips" into the facade application, where mylipsfile.lips is the name of your .lips file generated in the preceeding step.
Create an animation which plays enough frames to allow the entire utterance contained in your lipsync to play.
Press animate!

WorldBet

WorldBet is an ASCII encoding of IPA (International Phonetic Alphabet). Here is some WorldBet documentation
WorldBet Reference Sheet
WorldBet Paper - this is the whole story, in case you're interested..
If you only have time to look at one thing, check out the WorldBet column of the WorldBet Reference Sheet. Creating phoneme targets for all of those would be a big task, but if you think about it, groups of phonemes (speech sounds) can all produce the same viseme (face configurations). Consider "be" and "me" - they look identical. Consider "knee" "dee" and "tee" - again, visually identical. So the job isn't as big as it looks at first glance.

Coarticulation Ideas

Coarticulation is the effect which happens in natural speech whereby the presence of surrounding phonemes affects the expression of the "surrounded" phoneme. Acoustically, this means that a certain phoneme will sound different when it's in the presence of different company. Visually, there is a similar effect - the facial position reached at the height of pronouncing a certain phoneme will be affected by the articulatory demands of the preceeding and following phonemes. Currently, Facade is handling coarticulation in almost the simplest way possible. There are sets of target facial parameters associated with each phoneme, and the system linearly interpolates each parameter between the target values as time progresses. The parameters in the face will hit the target values at the middle of the time period spanned by a single phoneme, unless the phoneme is one for which the mouth needs to be closed exactly when the phoneme starts (B and F, for example). In those cases, the face will hit the targets at the onset of the phoneme.

Here are some coarticulation resources:
Paper by Cohen and Massaro about coarticulation, and their ideas about implementing it in a talking head. This page has some extra figures that didn't come through in the .ps or .pdf versions.
 

last updated Monday, March 19, 2001 by nico