GIP

A large set which is intended to help study and test facial expressions (e.g. validation, recognition, training) is made available in V3R GIP format, along with scripts for reading these GIP-specific formats, e.g. displaying a single frame from a video, parsing file headers, and much more.

The expressions dataset is accompanied by different ways to extract the raw 3-D data, remove improper signal using sanity checks/thresholds, etc. Each expression file comprises a few seconds of video, meaning about 10-15 3-D single frames. The easiest way to use that is to pick the frames and extract just the faces from the video. After that it is a relatively easy task to load the files into MATLAB (using GIP-supplied scripts).

Regarding the nature of the imaged subject, is is laid out in consistent locations that cross-frame analysis can help predict. That seems like the right thing to be using. In fact, building a model for each sequence can be interesting too (although the set might be too small given the density of the sample points, assuming full surface is sampled without increased spacing on a grid).

The nature of data from the expressions dataset suggests that it was acquired differently from the FRGC dataset. In fact, the noise cancellation and cropping algorithms as there were developed to handle the FRGC datasets cannot cope too well with the expressions set; to be specific, there are very many spikes everywhere, which smoothing alone cannot eliminate and the outliers removal phase also needs considerable adjustment. Regarding smoothing, especially since there are spikes, we decided to try median-based filtering, which worked reasonably well. In fact, any version should work fine, separable on x and then y, or vice versa (or even one of them). In practice, median is considered along both dimensions and the mean of both values is then taken as the spikes-resistant value. Median is what it is implemented with at the moment, but the one other problem is threshold setting, where the principal drawback is blindly deciding that one spike is noise (e.g. salt and pepper) and another is real signal that should be kept in tact. Having image-specific factors/issues in the GUI would lead to confusion as manually optimising to deal with one image (or image type, based on acquisition parameters) rarely generalises to all (unseen) images in the same set (glasses, hair, and impairments are unpredictable). Spending many hours dealing with pre-processing would probably detract from progress made on the 'meat' of this work, but then again, without automatic, intervention-free pre-processing, there is modeling based on very poor training data, which in turn will yield unimpressive results (poor breakdown into modes) and raise exceptions due to edge cases. Optimally, getting polished images with noise removed would be nice, albeit it would entirely miss the point of presenting an automatic method (or fully pipelined framework) which can handle data in its raw form.

Subsections

Roy Schestowitz 2012-01-08