The next few sections explain in some depth the notion of statistical models and especially that of (statistical) appearance models. They move on to the description of active appearance models which are an extension to active shape models and a brief introduction to shape models may be worthwhile to begin with.
Given a collection of images depicting an object which possesses some innate properties, it is then possible to express the visual appearance or shape of that object in a way that discards subtle changes in view-point, object position, object size et cetera and is robust to some level of object deformation. That object which appears in the group of images need not even be the exact same one; it can be an object belonging to one common class. Some variation that is typical for that class can be handled (essentially be understood) reliably with the help of elementary transformations (to be described in ), but their functionality is inevitably very limited and constrained.
There are statistical methods which allow the encoding of the variability which was learned during a so-called training process. That training process does not require far more than an exhaustive inspection of the set of images where objects (or shapes) appear. However, in order to interpret a large set of objects, some simplification steps are required. This results from the fact that most images where objects lie are expected to be of relatively large-scale in practice - certainly large enough to result in an exponential blow-up2.4.
A method is sought which reduces the amount of information that is required to describe an object of interest and the different forms it can take. This is done by selecting points of interest which lie in the image - ones which will be a representative sub-set of the image contents2.5. Points must be picked so that they jointly preserve knowledge regarding the object of interest. That object is often well-hidden in that pool of image pixels. Such points are often chosen to become what is entitled landmarks. Landmarks are positions in the image which effectively distinguish one object from another in the set of images (see Figure cap:Landmark-identification). They also have some interesting spatial traits which can form near-optimal curves (or contours) which together make up genuine shapes. The concatenation of the coordinates of these landmarks can then describe an image (or rather the object being focused on) in a concise and useful representation. In 2-D, for landmarks, a vector of size can infer the shape of the object present in an image. This lossy inference can be described as follows:
where is simply a discrete reconstruction of the shape in the image. It is not the actual image.
It is worth pointing out that landmark points can be chosen arbitrarily. This turns out to be a serious issue as will be seen later along with possible solutions. Identification of objects is in most cases2.6 done by drawing lines or selecting surfaces which surround these objects. Given continuous elements such as a lines or surfaces, by no criterion does it become obvious how to suitably sample them using points. The choice of points affects the quality of reconstruction as measured by the assigned errors.
With the concise landmark-based representation (described above in ) set to be the convention and a collection of fair-sized vectors rather than a massive collection of images, it should be possible to express (in a feasible way) the legal range2.7 of each one of the vector components. This in essence establishes the model. It is an entity that can be manipulated to reconstruct all the shapes (or as later explained - images) it originated from and far beyond that. This model encapsulates the variation which was learned from the data and it usually improves its performance as more legal examples are viewed and 'fed' to support some further training. Varying the parameters of the model can generate new (unseen) examples as long as that value variation is restricted by the legal range, as learned from the training examples. The vector representation mentioned beforehand can be also looked at as a description of a fixed location in space that comprises dimensions (see illustrative scatter in Figure ). This turns out to be a useful demonstrative idea as will be seen later when dimensionality reduction is applied.
Shape models are ``statistical information containers'' which can be built from the images with overlaid landmark points identified and recorded. In order to make such a mechanism possible, it is vital to firstly achieve consistency amongst the coordinates of all landmarks. This means that all points need to be projected onto a common space - a process whose purpose is to ease collective analysis. That process can also be thought of as an alignment step which somehow links to the next chapter. More issues that are concerned with normalisation, projection and the like are described in slightly more detail later in this document.
A human expert usually performs annotation or landmarking of the images with the aid of some computerised special-purpose tools. In recent years, alternatives which are automatic showed great promise [] and these extend to 3-D too []. The later chapter on page is dedicated purely to that one piece of work which is so fundamental to this current new research.
Appearance models were later developed by Edwards et al. [,] and the greatest advantage or essence of these was that they were able to sample grey-level2.8 data (incorporation of full colour has been made possible by now, e.g. Stegmann et al. [], [WWW-5]) from images rather than just points. Therefore, appearance models retained information about what an image looks like rather than just its form as visualised by contours (or surfaces in 3-D). Just as points in the image were earlier chosen, grey-level values (also referred to as intensity or texture) could be systematically extracted from a normalised image and preserved in an intensity vector for later analysis. This normalisation process and the representation of this intensity vector will be outlined later in this chapter.
What enables appearance models to exhibit quite an astonishing graphical resemblance to reality is that at the later stages of the process, a combined vector is made available. It incorporates both shape and intensity while keeping aware of how change in one affects the other (e.g. how expansion results in darkening and vice versa). Hence it has a notion of the correlation between the two - a notion that is dependent on the training data and Principal Component Analysis. Although appearance models are usually not as quick and accurate as shape models2.9, they contain all the information that is held in the shape models and in that sense are a superset2.10 of shape models. Also, some techniques have been developed and employed to speed up the matching of appearance models to image targets (see later in Section and Appendix ). Tasks such as the matching of an appearance model to some target image are described later in this chapter and illustrated in [].