Introduction

Non-rigid registration (NRR) of both pairs and groups of images is widely used as a basis for medical image analysis. Applications include structural analysis, atlas matching and change analysis [#!Crum_BJR!#]. The problem is highly under-constrained and many different algorithms have been proposed.

The aim of non-rigid registration is to
automatically find a meaningful, dense
correspondence across a pair (hence *pairwise* registration), or group (hence *groupwise*) of images. A typical algorithm
consists of a representation of the deformation
fields that encode the spatial variation between
images, an objective function that quantifies the
degree of mis-registration, and a method of
optimising the objective function. And different
algorithms tend to produce slightly different
results when applied to the same set of
images [#!Zitova_2003!#] - there is a need for
methods to evaluate the results of such
registrations.

Various methods have been proposed for assessing the results of NRR [#!Fitzpatrick_TMI_2001!#,#!Hellier!#,#!Validation-NRR!#,#!Schnabel!#]. One obvious approach is to compare the results of the registration to anatomical ground truth. However, this suffers from the problem that such ground truth is often difficult to obtain. For instance, expert annotation is time consuming, subjective, and very difficult in 3D. Other evaluation approaches involve the construction of artificial test data, which limits application to `off-line' evaluation. Furthermore, such artificially generated and manipulated correspondence does not necessarily capture the type of deformation seen in real data. These problems motivate the search for a method of evaluation that does not depend on the existence of ground-truth data, or on making possibly unrealistic assumptions about the nature of the actual correspondence.

The method we will present here is based on the idea of constructing statistical models of sets of images, models which consider both the shape and texture variation of the objects imaged (appearance models). Such models have been extensively used as the basis for image interpretation by synthesis. The link between registration and modelling is given by the fact that the output of registration is a dense correspondence across the set of images. Such a set of correspondences is required to construct the shape and texture models [#!Cootes_ECCV_1998!#,#!Edwards!#]. Varying the correspondence across a set varies the appearance model built upon this correspondence. If the NRR results in a poor estimate of correspondence across a set, then the appearance model constructed from the data will be unsatisfactory. A better correspondence ought to produce a better appearance model. This allows use to map the problem of evaluation of registration to that of evaluating the model generated from the output of the registration.

The structure of this paper is as follows. We
first give a brief description of the background
to both the assessment of registration, and of
the construction of appearance models, and
explain in more detail the link between the two.
We present quantitative measures which can be
used to assess the quality of such models, hence
of the registration upon which we will build such
models. The behavior of these measures is
investigated, and in particular, how they compare
to an assessment based on overlaps of manually annotated
ground-truth data.
The results demonstrate that our method gives measures
closely correlated with such ground-truth overlap estimates,
and that our measures are actually *more* sensitive to mis-registration
than the overlap measures.

Finally, we use the measures we have developed to compare various registration algorithms when applied to the registration of sets of 2D MR images of human brains. In particular, we are able to show the quantitative superiority of groupwise registration over a pairwise method.