Interpretation by synthesis has become a popular approach to image interpretation, because it provides a systematic framework for applying rich knowledge of the problem domain. Active Appearance Models (AAMs) [1,2] are typical of this approach. There are two essential components: a generative model of appearance, and a method for searching the model space for the instance that best matches a given target image. In this paper we concentrate on the first of these.
Many generative models of appearance are statistical in nature, derived from sets of training images. AAMs use models that are linear in both shape and texture. Their construction relies on finding a dense correspondence between images in the training set, which can be based on manual annotation or on an automated approach (see below). Other approaches to constructing appearance models include methods based on non-linear manifolds in appearance space  and kernel PCA . In the remainder of the paper we restrict our attention to AAMs, but the methods presented could be applied to any generative appearance model.
There has been relatively little previous work on model evaluation. One approach is to test a complete interpretation-by-synthesis framework, providing an implicit evaluation of the models themselves. This requires access to ground truth, allowing interpretation errors to be quantified [1,8]. The most serious weakness of this approach is that it confounds the effects of model quality and the behaviour of the search algorithm. The need for ground truth data is also undesirable, because it is labour intensive to provide and can introduce subjective error.
We propose a method for evaluating appearance models, that uses just the training set and the model to be evaluated. This builds on the work of Davies et al , who tackled the simpler problem of evaluating shape models. Our approach is to measure, directly, the similarity between the distribution of images generated by the model, and the distribution of training images. We define two measures: specificity - the overlap of the distribution of model-generated images with the distribution of training images, and generalisation ability - the overlap of the distribution of training images with the distribution of model-generated images. We validate the approach by generating progressively degraded models, demonstrating that both specificity and generalisation also degrade, monotonically. We compute the sensitivity of the two measures, showing that sensitivity is a better measure of model quality and then apply the method to a real model evaluation problem.