ROC Curves

At this stage we are able to form many kinds of benchmarks (ROC curves) on some of the data sets. We already get some numbers, but to get good numbers and organise them in ROC curves we need to finalise the protocols of dividing up the sets. In order to get ROC curves to be tested ASAP only X data was used, which is not terribly useful as X contains little signal in general (low entropy, too). All the preliminary results will therefore be more like a proof of concept.

The careful arrangement of sets will be necessary to ensure that many tests without too much overlap or repetition can be enrolled and used as our standard. The set of 86 distinct people should be partitioned sensibly. In order to test this and show the results are reasonable for a test set of just 13 faces, Figure $[*]$ displays the ROC curves acquired based on mean of differences (one of several similarity measures). We will need better ones, preferably with comparators too (overlaid curves for human judgment).

**Figure:** ROC curves plotted for just 13 tests done on the FRGC datasets with expressions isolated

It is usually worth checking if other groups or even people who work in the same lab have pre-partitioned and classified data (as per individuals). That would save the researcher the hassle of doing it manually. Just picking out expression took nearly 5 hours. The larger the dataset, the smoother the ROC curves will be, obviously.

Classification by hand takes a while, but it is crucial for results. The work done for NIST should have it for the training. The training partition contains nothing with expression variation, however, so we classify the image already isolated for the task of expression removal, comparing an approach that does not annual expression against a similar one that does.

Basic experiments were soon followed. In this very preliminary test we are dealing with a rather difficult set, using different acquisition conditions and different expressions from many people. We focus on dealing with just rigid registration (GIP latest ICP implementation) and simple metrics. ROC curves are plotted in accordance with the data gathered from 50 examples (see Figure $[*]$ ).

Next, we intend to improve the results with more cunning registration, annulment of facial expressions (e.g. the EDM approach), and most importantly improved algorithms for masking and aligning image parts, then measuring more meaningful properties in them.

**Figure:** Preliminary test where several images (not complete set) are used to get a rough idea of what the ROC curves will look like

With a much larger sample set which includes all the neutral-to-non-neutral pairings I ran the same experiment, this time using an older ICP, which uses PCA, to plot the ROC curves (see Figure $[*]$ ). ICP is only used for translation in this case. There is plenty of room for improvement and it should not be hard to get that improvement shortly. This has been an exercise in just testing the foundations of the framework, which now streamlines a lot better.

**Figure:** A somewhat larger test on non-neutral sets, where ICP based on PCA is used for alignment, then mean of residuals get used as a similarity measure

In Figure $[*]$ is the same type of curve for a method which was made more robust to noise and sensitive to differences.

**Figure:** The same type of comparison with the same type of set (as in Figure $[*]$ ) but with a more cunning similarity measure and an example of the X data at the bottom right

Comparisons have so far involved just the X/horizontal axis data (see Figure $[*]$ for X, Y, and Z data overlaid), which was not particularly useful for telling people apart. It was intended to test and explore some new code. A median-based method with squared differences taken into account is now put in place and it uses actual depth (Z alone used as signal/data) to perform tests on neutral and non-neutral images, as before. The results are, as expected, far better than before. Figure $[*]$ shows the 5 first matches that are correct and Figure $[*]$ shows the first 12 that are not correct (belonging to different people). Figure $[*]$ shows the classification of those 17 images, which are simply the first ones in the test set (no selection bias). The small scale of this experiment is intended to help track, on an image-by-image basis, what it going on. Larger experiments will follow.

**Figure:** A combined view of X, Y, Z, the image before ICP, after ICP, and the reference image

**Figure:** Difference images of the first 5 pairs taken from the same people

**Figure:** Difference images of the first 12 pairs taken from different people

**Figure:** ROC curve of the 17 images from figures $[*]$ and $[*]$

Next, model-based approached will be incorporated and then benchmarked against others, notably counterparts that do not take advantage of statistical expression annulment.

Figure $[*]$ shows the same ROC curve extended to account for a lot more image pairs (for which there is no accompanying matrix representing the contribution of each, as before). Comparative curves should be trivial to produce.

**Figure:** The curve showing the performance for 83 pairs from false matches and 37 from true matches