Data Deficiency

The codebase is in a state where it branches out to accommodate and handle different data types without the user needing to perform manual adjustments. Having constructed and tested binary masks for GIP datasets, similar binary masks needed to be made and tested for the data in the GC experiments, where scale variations are more common but can be modelled nonetheless. See Figure $[*]$ for 2 examples of masks we can use (assuming roughly fixed scale).

**Figure:** The masks used to crop residuals in the FRGC dataset. The left-hand one is more restrictive and selective in the sense that it omits some of the data associated with the face near the edges.

The datasets in the experiments' protocols are not sufficient because, as noted earlier, more data (from fewer individuals and as little as 3 should be enough) is required for training. To quote from IJCV (page 10 or 311 in the published book):

The FRGC dataset was augmented by 3006 scans that were acquired using a Minolta vivid scanner in our laboratory. 1 The 3006 scans belong only to three subjects (1000 non-neutral scans and 2 neutral per subject). The non-neutral scans were acquired while the subject was talking or reading loudly and with the intention to produce as diverse facial expressions as possible. The FRGC dataset have scans for a large number of subjects but we also need a large number of non-neutral scans per subject for experiment no. 3 (Sect. 3.5). In addition, some of them are used in model training (Sect. 3.2) as it turned out that the expression deformation model requires large training data.

3.2 Model Training Data

The training partition of the FRGC was not sufficient for training our Deformation Expression Model. Insufficient training data can result in noisy eigenvectors (of the model, Sect. 2.3), especially those with lower eigenvalues. Also, a smaller training data may lack enough instances of facial expressions of different people. Consequently, the model may not perform optimally during face recognition. To test how the size of the training data affects the performance of the deformation model, three Expression Deformation Models were trained using training data of 400, 800 and 1700 scan pairs. Then, they were used in non-rigid face recognition of 400 unseen probes under non-neutral expressions with an appropriate subspace dimension of the deformation model (the dimension is 55, see Sect. 3.3). The identification rates in the three cases were 89%, 93%, and 95%, respectively. The rates have increased for larger training data sizes.

In the following experiments, the Expression Deformation Model which gave best results (the one which is trained by 1700 pairs) was used as the generic deformation model. The 1700 pairs were formed among the training partition of the FRGC dataset (943 scans), 597 non-neutral scans from the evaluation partition (leaving about 1000 non-neutral scans for testing) and 500 scans from our acquired data.

to reproduce the results we may need access to this data. They seem to be reliant upon it and the GIP datasets are a tad different in nature, at least in the sense that they provide a lot of expressions from the same subject or many neutrals from a lot of different people, which makes the problem similar to that of running experiments with orphaned GC data and protocols (although it would not be identical to it, which only complicates this further).

It is clear, based on explanation given in the text, that we need to have very large (and augmented) training sets to be training a model from which to take the 55 utmost dimensions. Using the Perl scripts provided in BEE to parse the XML files would not be sufficient to reproduce the models and replicate the results. Upon failure it could be argued that we had not followed the documented procedure.

In order to plan analogous experiments we ought to check if their results are reproducible based on the details given in the paper and granted, if there is difficulty in reproducing these results, the authors can be contacted for pointers or perhaps be contested.

Reasons for scepticism in this case may seem unfair, but there are points of weakness that are only found along the way, as all of these things are being put together and then highlight issues which must be overcome - issues that are only mentioned/allu8ded to in the text but not properly addressed in a formal, technical sense (not even by reference to prior work). All this wiggling room is the reason framework mimicking has been slower than expected. Mian's prior work (he received his Ph.D. under Bennamoun's supervision in 2007) is basically the fundamental work his first student's work is based upon. It is an extension of his rigid, ICP-based implementation. This means that we need to decipher some of this prior work and build upon it the extensions that Osaimi put in place until his graduation last year. It's not a monumental task, especially if we aim to only disprove their parts about GMDS benchmarks, which seem unfair as they compare apples with oranges. We contacted the first author regarding availability of their dataset which they augment FRGC 2.0 with.

Roy Schestowitz 2012-01-08