PDF version of this entire document

Wavelets (Texas University)

An interesting strand of work comes from text which summarises key work from Texas University - work which explores the use of wavelets and an approach which can be found in New Approaches to Automatic 3-D and 2-D+3-D Face, a thesis at https://repositories.lib.utexas.edu/bitstream/handle/2152/ETD-UT-2011-05-2990/JAHANBIN-DISSERTATION.pdf?sequence=1their repository. The work uses an interesting dataset, and in the case of the paper this seems limited to just a recently-acquired set matching particular requirements/protocol. From one description it emerges that T3FRD is the largest free and publicly available database of co-registered 2-D and 3-D face images that is suitable for separate evaluation of the recognition task. For its construction, 1196 pairs of high resolution range and colored portrait images were captured from 116 adult subjects at the former Advanced Digital Imaging Research (ADIR) LLC (Friendswood, TX) using a MU-2 stereo imaging system made by 3Q Technologies Ltd. (Atlanta, GA), under contract to NIST.

The work seems novel enough, as it combines the data embedded and encoded in the form of texture and also the underlying geometry. The core or the 'engine' is quite simple to grasp since it mostly relies on compressibility of a finite set of features, or rather their mutual entropy. It is similar to work I did with NRR, where a bunch of wavelets are used to assess misregistration through inherent correlation, using compressibility as a measure of similarity, or entropy estimation (somewhat related to mutual information). Quite a few people tried this before with limited levels of success. It is a 'cheap' way to get similarity working, with libraries that are used by JPEG for example.

The explanations are well organised and the opening parts of the thesis are clear. They provide a good, broad overview of existing methodologies. The latter sections show the authors using their own data, matching the requirements from NIST, preprocessed and aligned using ICP. This also corrects pose and scale, which makes the T3FRD database dependent upon other methods. Although it is smaller than the FRGC database, it is said to be a lot simpler to deal with (for instance, the poses in there are more limited). The texture gets exploited as well and this too gets refined in order to remove effects of noise and annual comparative analyses where rigid or affine stages act as distinguishers at the expense of the latter stages, which are typically the more interesting ones and those that are more actively researched.

The author is looking at rigid areas of the face, so recognition rates are understandably high. Comparisons, performance-wise, should preferably involve the T3FRD as the only data to work on. The results reported when applied to the older-but-commonly-used Face Recognition Grand Challenge (FRGC) should be treated and referred to an altogether separate benchmark with different levels of difficulty (increasing based on the semester in which the images got acquired). This perhaps limits the number of methods actually compared in this paper, as not so many studies were done with the same database. An important point that gets raised is that researchers can make fair comparisons between competing algorithms based on recognition capability only without biases introduced by pre-processing, which makes T3FRD an attractive alternative to the older FRGC database. It does, however, limit the scope of benchmarks based on the literature and there is a short description of the hole-filling and median filter used (with 3x3 kernel). A lot of space is being dedicated to justifying the use of T3FRD rather than FRGC, even though it does not have many subjects in it, based on the Results section (116 subjects enrolled for the experiments, only 18 used for training)). This can make the recognition problem simpler. Gupta et al. published similar results in IJCV last year, proving or at least validating to a degree the quality of these results. Perhaps the text can be made more concise by tightening the description of the dataset which makes it seem slightly promotional at times and not exceedingly relevant to the methods presented by the paper.

Addressing the methods at hand, there is merit in adopted approach, which takes wavelets (barycentric in this case) and multi-scale (coarse-to-fine) ideas to get locations, extract wavelet coefficients, and sometimes use PCA to then model the variation as characterised by those succinct, localised descriptors. It actually pools together three recognition sub-systems that are aggregated with LDA to perform a decision on several levels. For landmark detection, Gabor-based methods are utilised, identifying eye corners and the nose tip. A region ensemble where the faces are treated as a set of regions would work well provided the division into regions is accurate. The multitude of regions provides extra robustness, in case one discriminant is overly weak or misleading.

The use of Gabor jets is not so new, but it does provide compelling starting points that are fiducials (1 millimeter away from manual landmark for the nose tip and about two for eye corners). The descriptions in the remainder of the text are rich in words and light on equations which would otherwise help formalise the process or methodology, before leaping to results. It is not entirely clear by this stage (until the later block diagram and composition of methods) what the novelty of this work is. A figure this showing a this diagram helps visualise the proposed framework and had it been shown earlier it would have helped foresee the structural aspects of the problem domain. This is what several IJCV papers in this area sometimes do. The separation between methods and experiments is not clear because the results section keeps introducing new methods or variants of these basic, pertinent methods.

ROC and CMC curves show impressive performance, but one must bear in mind the quality of the data, the set size, and multi-modality of sorts (not just range images). Some of the compared-to studies (including one from Le Zou et al.) are from the same institute and Mahoor et al. from Miami University is perhaps a better one to compare to, although there may be the bias due to using another university's database. McKeon et al. (Robert McKeon and Trina Russ of Digital Signal) train on the FRGC database and get competitive results, very much comparable to those from the group in Texas University.

Overall, performance shown in this work is high, the degree of difficulty is hard to assess as not many other studies were done with this dataset (not outside Texas University), and there is novelty in the way measures are combined to attain a powerful discriminant, rooted primarily in 3 limited regions of the face (rigidly registered to begin with). The work would benefit from having a set of results from experiments applied to the FRGC dataset, in order to demonstrate performance from another frame of reference.

Addressing more specific points, while the paper has real importance, there are concerns and it also comes with caveats because of the lack of adequate benchmark comprehensive enough. The work contains original work. Somewhat dated, a bit dependent on similar work, but original nonetheless. Notable absence of more tests definitely stands out. A lot of space is dedicated to advocating the use of this data, whereas a more useful thing to do would be to explain the methods. The abstract does highlight the limitation. Methods could be described better, especially due to the order of presentation. It is mostly acceptable, but there is room for improvement. The grammatical quality of the text is high with few exceptions like typographical errors, many formatting inconsistencies in the references (e.g. page numbers, commas, et cetera), and some issues that require ironing out. Among drawbacks of some of the covered methods is that they are also slowed down/performance degraded non-linearly. Nothing is being said about performance in the paper, e.g. time taken to run experiments. Normalisation is well defended in the text which explains composition of regions. Related to this there is work in the International Journal of Computer Vision [12] (http://live.ece.utexas.edu/publications/2010/sg_ijcv_june10.pdfonline version). It is prior work on geodesic distances for recognition, courtesy of S. Gupta, M. Markey, and A. Bovik, Referring specifically to their use of geodesic distance, they write: Lastly, we develop a completely automatic face recognition algorithm that employs facial 3D Euclidean and geodesic distances between these 10 automatically located anthropometric facial fiducial points and a linear discriminant classifier. [...] We develop a successful 3D face recognition algorithm that employs Euclidean and geodesic facial anthropometric distance features and a linear discriminant analysis (LDA) classifier. [...] As features for our proposed Anthroface 3D algorithm, we employed 300 3D Euclidean distances and 300 geodesic distances between all of the possible pairs [...] We computed geodesic distances along the facial surface using Dijkstra's shortest path algorithm (Dijkstra 1959; Tenenbaum et al. 2000). Besides 3D Euclidean distances, the motivation for employing geodesic distances was that previous studies have shown that geodesic distances are better at representing `free-form' 3D objects than 3D Euclidean distances (Hamza and Krim 2006). Furthermore, a recent study suggested that changes in facial expressions (except for when the mouth is open) may be modeled as isometric deformations of the facial surface (Bronstein et al. 2005). When a surface is deformed isometrically, intrinsic properties of the surface, including Gaussian and mean curvature and geodesic distances, are preserved (Do Carmo 1976). Hence, algorithms based on geodesic distances are likely to be robust to changes in facial expressions. From among the 300 Euclidean and 300 geodesic distances, we selected subsets of the most discriminatory distance features, using the stepwise linear discriminant analysis [...] Using this procedure we identified the 106 and 117 most discriminatory Euclidean and geodesic distance features from among the 300 Euclidean and 300 geodesic distances, respectively. We pooled these 106 Euclidean and 117 geodesic anthropometric distances together, and using a second stage stepwise linear discriminant analysis procedure, we identified the final combined set of 123 most discriminatory anthropometric facial distance features.

Figure 2 shows an example of a simplistic implementation we have implemented, as will be shown later.

This is rationalised by claiming that the Anthroface 3D recognition algorithm (Sect. 3.3), with Euclidean and geodesic distances between 25 arbitrary facial points (Fig. 2) instead of the 25 anthropometric fiducial points (Fig. 1). These points were located in the form of a 5 5 rectangular grid positioned over the primary facial features of each face (Fig. 2). We chose these particular facial points as they measure distances between the significant facial landmarks, including the eyes, nose and the mouth regions, without requiring localization of specific fiducial points. A similar set of facial points was also employed in a previous 3D face recognition algorithm for aligning 3D facial surfaces using the ICP algorithm (Lu et al. 2006).

For all faces in the test data set, the 123 most discriminatory anthropometric Euclidean and geodesic distance features x were first computed. They were projected onto the 11D LDA space as y = Wx that was learned using the training data set.

They are using the T3FRD database, which makes things a lot simpler due to the aforementioned factors.

Roy Schestowitz 2012-01-08