Lipsync for Facial Animation
by Dave Merrill
 | Magpie
- a tool for manual phonetic transcription |
 | BaldiSync - available as part of the CSLU
Toolkit for speech research. |
Discussion: Magpie is a good tool for doing manual phonetic
transcriptions. The not-free version of magpie also has some integration
features with various animation software. Baldisync is part of a larger toolkit
for speech research and application development. Baldisync does automatic
phonetic transcription via a forced-alignment procedure (uses speech-recognition
techniques on the .wav file to discover the phoneme boundaries).
In order to generate an alignment, Baldisync needs a .wav file,
and a transcription of the speech contained within the .wav file. Baldisync can
output a .sob file, which is a binary format that contains both the audio data
as well as transcription information. Click here to
download lerp.tcl, a Tcl/Tk script which will allow you to choose a .sob file,
then generate a .lips file which contains mouth parameters for every frame
(assuming 30fps) of your animation. It will also generate a .wav file which is
named correctly. (note - the libraries needed to run this TCL script are
installed with the CSLU toolkit, but are not part of a default Tcl/Tk
installation). I used the following .wav file which comes with magpie as an
example: vista.wav: a recording of Arnold, saying "Hasta
la Vista, Baby" The intermediate alignment looks like this:
MillisecondsPerFrame: 1.0
END OF HEADER
0 10 .pau
10 140 .pau
140 170 h
170 260 @
260 310 s
310 320 tc
320 370 th
370 390 &
390 510 .pau
510 620 l
620 630 A
630 670 .pau
670 740 v
740 860 I
860 980 s
980 1020 tc
1020 1080 th
1080 1120 ^
1120 1630 .pau
1630 1640 bc
1640 1650 b
1650 1840 ei
1840 1900 bc
1900 1910 b
1910 2050 i:
2050 2507 .pau There's also a word-level transcription, so you can tell
where the word boundaries are in the above transcription:
MillisecondsPerFrame: 1.0
END OF HEADER
140 510 hasta
510 670 la
670 1630 vista
1630 2507 baby
I'm not sure that the word-level information would be as useful -
it could be used for some kind of "follow the bouncing ball" visual
display of what the face is currently saying.
For the purposes of animation, the output you care about will be the .lips file
and the .wav file generated by lerp.tcl. The .lips file contains speech-related
parameters and their values for every frame of the utterance. The .wav
file contains the actual sound data, which the program will play at
animation-time. Both files need to go into the Anims folder located in your
project directory. Then, you can enable lipsync, by typing into the GUI:
le myspeech.lips
where "myspeech.lips" is the name of your .lips file.
You need to have an animation which is long enough to allow your utterance to
play, and then you're ready to go!
Here's what the .lips file looks like, in case you're interested:
k
p4 -0.5
p14 0.0
p15 0.0
p16 0.00670219
p17 0.0
p63 0.0
p64 0.0288841
p65 0.0
l66 0.0
r66 0.0
p67 0.0229465
p68 0.0
p70 0.0233635
p71 0.0
p72 0.0
p73 0.0
p74 -0.0267606
p75 0.020504
p76 -0.0372015
p77 -0.0345186
p78 -0.00849802
k
p4 -0.5
p14 0.0
p15 0.0
p16 0.00670219
...
The k's on a line by themselves delimit one frame of animation
from the next. Within a single frame is information about the parameters
involved in lipsync, and the associated values for these parameters.
 | If you don't have the CSLU toolkit yet, but you'd like
to try out the lipsync capabilities of facade, you can download hasta.lips and
hasta.lips.wav . Put them both into the
Anims folder, and start with the "Enable Lipsync" step below.
 | Download and install the CSLU speech
toolkit, if you haven't
done so already.
 | Record a .wav file
 | Generate a .sob file from your .wav file (and the text of
whatever is being said in the .wav file) in the BaldiSync program which comes
with the toolkit.
 | Run the lerp.tcl Tcl/Tk script to
generate a .lips file which contains lipsync params for every frame of the
utterance, as well as a .wav file which is named correctly. Put your .lips
file and the .wav file into the Anims folder.
 | Enable lipsync by typing "le mylipsfile.lips" into
the facade application, where mylipsfile.lips is the name of your .lips file
generated in the preceeding step.
 | Create an animation which plays enough frames to allow the
entire utterance contained in your lipsync to play.
 | Press animate!
| | | | | | | |
WorldBet is an ASCII encoding of IPA (International Phonetic
Alphabet). Here is some WorldBet documentation
If you only have time to look at one thing, check out the
WorldBet column of the WorldBet Reference Sheet. Creating phoneme targets for
all of those would be a big task, but if you think about it, groups of phonemes
(speech sounds) can all produce the same viseme (face configurations). Consider
"be" and "me" - they look identical. Consider
"knee" "dee" and "tee" - again, visually
identical. So the job isn't as big as it looks at first glance.
Coarticulation is the effect which happens in natural speech
whereby the presence of surrounding phonemes affects the expression of the
"surrounded" phoneme. Acoustically, this means that a certain phoneme
will sound different when it's in the presence of different company. Visually,
there is a similar effect - the facial position reached at the height of
pronouncing a certain phoneme will be affected by the articulatory demands of
the preceeding and following phonemes. Currently, Facade is handling
coarticulation in almost the simplest way possible. There are sets of target
facial parameters associated with each phoneme, and the system linearly
interpolates each parameter between the target values as time progresses. The
parameters in the face will hit the target values at the middle of the time
period spanned by a single phoneme, unless the phoneme is one for which the
mouth needs to be closed exactly when the phoneme starts (B and F, for example).
In those cases, the face will hit the targets at the onset of the phoneme.
Here are some coarticulation resources:
 | Paper by
Cohen and Massaro about coarticulation, and their ideas about implementing
it in a talking head. This page has some extra figures that didn't come
through in the .ps or .pdf
versions. |
|