Face Modeling Language (FML)

Version 2.1

Interactive Media Technologies Inc. (iMediaTek) 

September 15th , 2005

 

Table of Contents

INTRODUCTION
    Scope
    Motivation
    Basic Concepts
FML DOCUMENT STRUCTURE
LANGUAGE CONSTRUCTS
    Primary Elements
        Modeling
        Story Description and Time Containers
        Primitive Moves
    Decision-Making and Event Handling
        External Events
        Exclusive Time Containers
    Iteration
        Definite Loops
        Indefinite (Conditional) Loops
    Behavioral Templates
FML OBJECT MODEL
    Elements
    Attributes
FML AND MPEG-4
REFERENCES

Introduction

Scope

This document is the specification for Face Modeling Language (FML) designed based on research projects done in the Department of Electrical and Computer Engineering, University of British Columbia, and the School of Interactive Arts and Technology, Simon Fraser University. FML is a content description language for face animation. Motivations and basic concepts of FML are discussed, and its language constructs and object model (entities and their relations and attributes) are defined.

FML is based on Extensible Markup Language (XML) [1] and shares ideas and concepts with other standards, languages, and technologies. Such common issues (e.g. XML document structure and parsing) are not discussed in this document, either.

An FML-compatible animation system has three major parts:
* FML Processor
* Animation Player that uses the FML document as input
* Application that owns the player object

Face Modeling Language is designed to be independent of the face animation methods used to render the scenes it describes. Such methods are not explicitly discussed here but some aspects of FML are developed with certain needs of animation players in mind.

Motivation

Face Animation, as a special type of multimedia presentation, has been a challenging subject for many researchers. Advances in computer hardware and software, and also new web-based applications, have helped intensify these research activities, recently. Video conferencing and online services provided by human characters are good examples of the applications using face animation. Personalized Face Animation includes all the information and activities required to create a multimedia presentation resembling a specific person. The input to such system can be a combination of audio/visual data and textual commands and descriptions. A successful face animation system needs to have efficient yet powerful solutions for providing and displaying the content, i.e. a content description format, decoding algorithms, and finally an architecture to put different components together in a flexible way.

Advances in computer graphics techniques, as mentioned before, have allowed incorporation of computer generated content in multimedia presentations. Many techniques, languages, and programming interfaces are proposed to let developers define their virtual scenes. OpenGL [2], Virtual Reality Modeling Language (VRML) [3], and Synchronized Multimedia Integration Language (SMIL) [4] are only few examples in this regard. The growing use of web-based systems have also encouraged the use of such languages, since they make it possible to transmit only a textual description rather than complete audio-visual data, provided the audio-visual effects of these actions can be recreated with a minimum acceptable quality. Although new streaming technologies allow real-time download/playback of audio/video data, but bandwidth limitation and its efficient usage still are, and probably will be, major issues.

In face animation (and also other cases) minimizing the data transfer time is not the only advantage of content specifications. In many situations, the "real" multimedia data does not exist at all, and has to be created based on a description of desired actions. This leads to the whole new idea of representing the spatial and temporal relation of the facial actions. In a generalized view, such a description of facial presentation should provide a hierarchical structure with elements ranging from low level "images", to simple "moves", more complicated "actions", to complete "stories". We call this a Structured Content Description, which also requires means of defining capabilities, behavioural templates, dynamic contents, and event/user interaction.

Based on the above ideas, in face animation, some researches have been done to translate certain facial actions into a predefined set of "codes". Facial Action Coding System [5] was probably the first successful attempt in this area. More recently, MPEG-4 standard [6] has defined Face Definition and Animation Parameters (FDP and FAP) to encode low level facial actions like jaw-down, and higher level, more complicated ones like smile. It also provides Extensible MPEG-4 Textual format (XMT) as a framework for incorporating textual descriptions in languages like SMIL and VRML. XMT does not include any face-specific features, yet.

Due to its capabilities, popularity, and availability of parsing tools, Extensible Markup Language (XML) seems to be the best choice as basis of a content description language for face animation. Such language can be considered a natural high-level abstraction on top of MPEG-4 FAPs and should be able to function as part of XMT framework.

Basic Concepts

Face Modeling Language (FML) is a Structured Content Description mechanism based on Extensible Markup Language. The main ideas behind FML are:
* Hierarchical representation of face animation
* Timeline definition of the relation between facial actions and external events
* Defining capabilities and behavior templates
* Compatibility with MPEG-4 XMT and FAPs
* Compatibility with XML and related web technologies and existing tools

FACS and MPEG-4 FAPs provide the means of describing low-level face actions but they do not cover temporal relations and higher-level structures. Languages like SMIL do this in a general purpose form for any multimedia presentation and are not customized for specific applications like face animation. A language bringing the best of these two together, customized for face animation, seems to be an important requirement. FML is designed to do so, filling the gap in XMT framework for a face animation language.

Fundamental to FML is the idea of Structured Content Description. It means a hierarchical view of face animation capable of representing simple individually-meaningless moves to complicated high level stories. This hierarchy can be thought of as consisting of the following levels (bottom-up):
* Frame, a single image showing a snapshot of the face (Naturally, may not be accompanied by speech)
* Move, a set of frames representing linear transition between two frames (e.g. making a smile)
* Action or Act, a "meaningful" combination of moves
* Story, a stand-alone piece of face animation

The boundaries between these levels are not rigid and well defined. Due to complicated and highly expressive nature of facial activities, a single move can make a simple yet meaningful story (e.g. an expression). The levels are basically required by content designer in order to:
* Organize the content
* Define temporal relation between activities
* Develop behavioural templates, based on his/her presentation purposes and structure.

FML defines a timeline of events (Figure 1) including head movements, speech, and facial expressions, and their combinations. Since a face animation might be used in an interactive environment, such a timeline may be altered/determined by a user. So another functionality of FML is to allow user interaction and in general event handling (Notice that user input can be considered a special case of external event.). This event handling may be in form of:
* Decision Making; choosing to go through one of possible paths in the story
* Dynamic Generation; creating a new set of actions to follow

Figure 1. FML Timeline and Temporal Relation of Face Activities

A major concern in designing FML is compatibility with existing standards and languages. Growing acceptance of MPEG-4 standard makes it necessary to design FML in a way it can be translated to/from a set of FAPs. Also due to similarity of concepts, it is desirable to use SMIL syntax and constructs, as much as possible. Satisfying these requirements make FML a good candidate for being a part of MPEG-4 XMT framework.

FML Document Structure

FML is an XML-based language, following the same structural rules (e.g. well-formedness constraints) and sharing the same syntax. The choice of XML as the base for FML is based on its capabilities as a markup language, growing acceptance, and available system support in different platforms. Figure 2 shows typical structure of an FML document.

<fml>
    <model> <!-- Model Information -->
        <model-info-item>
    </model>
    <story> <!-- Animation Time Line -->
        <act>
            <time-container>
                <move-item>
                <...>
            </time-container>
            <...>
        </act>
        <...>
    </story>
</fml>

Figure 2. FML Document Map

An FML document consists, at higher level, of two types of elements: model and story. A model element is used for defining face capabilities, parameters, and initial configuration. A story element, on the other hand, describes the timeline of events in face animation. It is possible to have more than one of each element but due to possible sequential execution of animation in streaming applications, a model element affect only those parts of document coming after it.

Face animation timeline consists of facial activities grouped into act modules. Within each group, activities are defined as simple Moves and their temporal relations. The timeline is primarily created using two time container elements, seq and par, corresponding to sequential and parallel temporal relation between moves. A story itself is a special case of sequential time container. The begin times of activities inside a seq and par are relative to previous activity and container begin time, respectively. story and act are special cases of sequential time container which can only be used at top levels of FML document.

FML supports three basic face activities (moves): talking, facial expressions, and 3D head movements. Combined in time containers, they create an FML act. This combination can also be done using nested containers.

Language Constructs

Primary Elements

Modeling

FML model element embodies all the modeling and configuration parts of the document. In version 2.1 this can include the following elements:

* character: The person to be displayed in the animation; This element has one major attribute name and is used to initialize the animation player database.

* img: The image to be used for animation; This element has two major attribute file and type. It provides an image and tells the player where to use it. For instance the image can be a frontal or profile pictures used for creating a 3D geometric model. The usage and value of type are player-dependent.

* sound: The sound data to be used in animation; This element also has a file attribute that points to a player-dependent audio data file/directory.

* range: Acceptable range of head movement in a specific direction; It has two major attributes: type and value specifying the direction and the related range value.

* param: Any player-specific parameter (e.g. MPEG-4 FDP); param has three attributes type , name and value .

* data: Any player-specific animation data file/directory (e.g. a 3D geometric model); data has two attributes name and file .

* template and event : Behavioral models and external event; These elemente will be discussed in details in later sections.

* bgsound (bgs): Background audio file

All these elements except template are XML empty elements (i.e. the information is in their attributes). Their absence is not considered a syntax error, since the animation player is supposed to use its default values. Figure 3 illustrates a sample FML model.

<model>
    <img file="me.jpg" type="front" />
    <range type="left" value="60" />
    <template name="hi" >
        <seq begin="0">
            <talk>Hello</talk>
            <hdmv begin="0" end="5" type="0" value="30" />
        </seq>
    </template>
</model>
<story>
    <behavior name="hi" />
</story>

Figure 3. FML Model and Templates

Story Description and Time Containers

FML timeline, presented in Stories, consists primarily of Acts which are purposeful set of Moves. The Acts are performed sequentially but may contain parallel Moves in themselves. Time Containers are FML elements that represent the temporal relation between moves. The basic Time Containers are seq and par corresponding to sequential and parallel activities. The former contains moves that begin at the same time and latter contains moves that start one after another. The Time Containers include primitive moves and also other Time Containers in a nested way.

Time Containers have three other attributes begin, duration, and end (default value for begin is zero, and duration is an alternative to end ) that show the related times in milliseconds.

FML also has a third type of Time Containers, excl , used for implementing exclusive activities and decision-making as discussed later.

Primitive Moves and Commands

FML supports three types of primitive moves: talk, expr, and hdmv for speech, facial expressions, and 3D head movements, correspondingly. fap element is also considered for direct embedding of MPEG-4 FAPs.

* talk (spk) is a non-empty XML element and its content is the text to be spoken.

* expr (exp) specifies facial expressions with attributes type and value . The expression types can be neutral, joy, sadness, anger, fear, disgust, surpris, blink, and nod. They can have a value from zero to 100%. expr is an empty element.

* hdmv (mov) handles 3D head movements with attributes type (yaw, pitch, and roll) and value (-100% to 100%). Considering the three axes X (horizontal), Y (vertical), and Z (normal to 2D plane), these movements are rotation around the axes. This move is also an empty element and has the same attributes as facial expressions.

* fap inserts an MPEG-4 FAP into the document. It is also an empty element with attributes type (FAP number) and value (-100% to 100%).

* rprm (rfp) activates a legacy rFace parameter. It is an empty element with attributes type (param number) and value.

* param (prm): Any player-specific parameter (e.g. MPEG-4 FDP); param has three attributes type , name and value . For example, if used for a Component param in iFACE system, type="comp" name="2-1-1" (group-param-subparam) and value="10"

* play (run) plays a wave or keyframe file. It is an empty element with only one necessary attribute, file (filename). A nonzero value means play-to-file with given FPS.

* capture (rec) captures the audio and animates the face accordingly. This is an empty element with no attributes.

* target (out) is the file that is the target of recor or playback.

* movie (f2m) makes a movie named in file using the current background audio and last output frames and the FPS given in value.

* txture (img) loads a new texture file.

* ptype (pty) loads a new personality type file.

* reset (clr) resets the face.

* bgsound (bgs): Background audio file

* geometry (geo) opens a new geometry file (x, BMP, JPG, MSH, IMG, CHR)

* wait (nop) performs no operation. It is an empty element with timing attributes, only.

* system (sys) executes a system command using value.

* exit (end) terminates the script. It is an empty element without any attributes.

<act>
    <seq begin="0">
        <talk>Hello</talk>
        <hdmv end="5" type="0" value="30" />
    </seq>
    <par begin="0">
        <talk>Hello</talk>
        <expr end="3" type="3" value="50" />
    </par>
</act>

Figure 4. FML Time Containers and Primitive Moves

All primitive moves have three other attributes begin, duration, and end (default value for begin is zero, and duration is an alternative to end). In a sequential time container, begin is relative to start time of the previous move, and in a parallel container it is relative to the start time of the container. In case of a conflict, duration of moves is set according to their own settings rather than the container. Figure 4 illustrates the use of time containers and primitive moves.

Decision-Making and Event Handling

External Events

The interaction between the owner application (or user) and the FML document is provided through FML External Events. In FML version 1.0, External Events are used in decision-making and indefinite iteration. Generally, they can be used for any interaction by users/applications to dynamically define or alter the behavior of FML document.

External Events are defined by event elements in model section of an FML document. Each event will be given a name and an initial value by its attributes and form an empty XML element, for example:
<event name="user" value= "-1" />

Exclusive Time Containers

Normal Time Containers (i.e. sequential and parallel) define the order in which activities inside an Action are performed. The Exclusive Time Container, excl , allows making decisions and choosing an option among a set of available activities. This is the primary means of dynamically controlling the behavior of an FML document. Each Exclusive Time Container is associated with a pre-defined External Event and performs only one of its available Move sets based on the event value, as shown in Figure 5.

<event name="user" value="-1" />
. . .
<excl event_name="user">
    <talk event_value="0">Hello</talk>
    <talk event_value="1">Bye</talk>
</excl>

Figure 5. FML Decision-Making

If the event value does not match any of the values specified by event_value the FML document playback pauses until the value is set by the user/application. The FML Processor exposes proper interface function to allow event values to be set in run time. event is the FML counterpart of familiar if-else constructs in normal programming languages.

Using repeat attribute (discussed in the next section) we can allow event handlers to work more than once. Using "resident" value for type attribute (of an excl) makes the event handler go to resident mode where the script seems to terminate but event handler continues to work.

Iteration

Definite Loops

Iteration in FML is provided by repeat attribute of Time Container elements that simply cycles through the content for the specified number of times (in Definite Loops) or until a certain condition is satisfied (Indefinite Loops). For a Definite Loop, repeat is either a number or the name of an external event with a numeric non-negative value.

Indefinite (Conditional) Loops

Indefinite Loops are formed when the repeat attribute is associated with an external event (e.g. "kbd;F1_up" for F1 key released event). In such cases, the iteration continues until the event happens. Figure 6 shows examples of FML iteration.

<event name="select" value="kbd;F1_up" />
< ... >
<act repeat="select">
    <seq>
        <talk begin="1">Come In</talk>
        < ... >
    </seq>
</act>

Figure 6. FML Iteration

Behavioral Templates

In version 1.0, FML behavioral templates are similar to subroutines in programming languages. They define a set of parameterized activities to be recalled inside the Story using behavior element. But they can be extended to include behavioral rules and knowledge bases, specially for interactive applications, in later versions. A typical model element is illustrated in Figure 7, defining a behavioral template used later in story.

<model>
    <template name="hi" >
        <seq begin="0">
            <talk>Hello</talk>
            <hdmv begin="0" end="5" type="0" value="param-1" />
        </seq>
    </template>
</model>
<story>
    <behavior name="hi" param-1="50" />
</story>

Figure 7. FML Behavioral Templates

FML Object Model

Elements

The FML object model consists of FML element and their base classes. Figure 8 summarizes the hierarchy of object classes in FML documents and their attributes.

FMLElement ( id )
    FMLTimeContainer 
    ( begin, duration, end, value, repeat ) //value for event
        seq
            story
            act
        par
        excl ( name ) //name for event
    FMLMove ( type, value , file )
        talk
        expr
        hdmv
        fap
        play
        capt
        save
        txtr
    FMLModelItem ( type, name, value, file )
        character
        img
        sound
        param
        data
        range
    FMLEtc ( name, value )
        template  
        event  
        behavior

Figure 8. FML Object Model

It should be noted that each FMLMove object is in fact a Time Container including only one move. Also worth noting is that some attributes may be left unused by the related objects. For example the elements in FMLEtc usually use either value or name.

Attributes

Most of the attributes are self-explanatory. Followings are some comments on those with different usage.

type

The type attribute is used in range, fap, expr, and hdmv can have the following numeric or string values:

hdmv (the two last values are related to movement in XY plane)
    yaw
    pitch
    roll
expr
    netral
    joy
    sadness
    anger
    fear
    disgust
    surprise
    nod
    blink

fap
    standard MPEG-4 FAP numbers

excl
    "resident" if we want the event processing to remain active while allowing the script to end.

value

In FMLMove elements and also for range , value has a numeric value (adding a d at the end makes it relative, AKA delta). Otherwise, it is a string (name, address, ...).

range and hdmv , relative movement in degrees.

expr , percent of full expression

fap , standard MPEG-4 FAP values

begin, duration, end

All times are in milliseconds by default. An ending s means the value is in seconds, e.g. begin="2s".

 

FML and MPEG-4

FML is a high level abstraction on top of MPEG-4 Face Animation Parameters. FAPs can be grouped into the following categories:
* Visemes
* Expressions
* Low-level facial movements

In FML, visemes are handled implicitly through talk element. The FML processor translates the input text to a set of phonemes and visemes compatible with those defined in MPEG-4 standard. FML facial expressions are defined in direct correspondence to those in MPEG FAPs. For other face animation parameters, the fap element can be used. This element works like other FML moves, and its type and value attribute are compatible with FAP numbers and values.

Considering this compatibility, FML documents can be easily translated into MPEG-4 streams which make FML a good candidate for Extensible MPEG-4 Textual Format (XMT) framework.

References

[1] http://www.w3.org/xml
[2] http://www.opengl.org
[3] http://www.vrml.org
[4] Bulterman, D., "SMIL-2," IEEE Multimedia, October 2001.
[5] Ekman, P., and Friesen, W.V. (1978). Facial Action Coding System, Consulting Psychologists Press Inc., 1978.
[6] Battista, S., et al, "MPEG-4: A Multimedia Standard for the Third Millennium", IEEE Multimedia, October 1999.