Non-rigid registration (NRR) of both pairs and groups of images has been used increasingly in recent years, as a basis for medical image analysis. Applications include structural analysis, atlas matching and change analysis. The problem is highly under-constrained and the plethora of different algorithms that have been proposed generally produce different results for a given set of images. We present two methods for assessing the performance of non-rigid registration algorithms applied to groups of images; one requires ground truth to be provided a priori, whereas the other does not. We compare the two approaches by systematically varying the quality of registration of a set of MR images of the brain.
The first of the proposed methods for assessing registration
quality uses a generalisation of Tanimoto's spatial overlap measure.
We start with a manual mark-up of each image, providing an anatomical/tissue
label for each voxel, and measure the overlap of corresponding labels
following registration. Each label is represented using a binary image,
but after warping and interpolation into a common reference frame,
based on the results of NRR, we obtain a set of fuzzy label images.
These are combined in a generalised overlap score [1]:
The second method assesses registration in terms of the quality of a generative statistical appearance model, constructed from the registered images - for all the experiments reported here, this was an active appearance model (AAM). The idea is that a correct registration produces an anatomically meaningful dense correspondence between the set of images, resulting in a better appearance model. We define model quality using two measures - generalisation and specificity. Both are measures of overlap between the distribution of original images, and a distribution of images sampled from the model. If we use the generative property of the model to synthesise a large set of images, , we can define Generalisation :
(2) |
where is a measure of distance between images, is the training image, and is the minimum over (the set of synthetic images). That is, Generalisation is the average distance from each training image to its nearest neighbour in the synthetic image set. A good model exhibits a low value of , indicating that the model can generate images that cover the full range of appearances present in the original image set. Similarly, we can define Specificity :
(3) |
That is, Specificity is the average distance of each synthetic image from its nearest neighbour in the original image set. A good model exhibits a low value of , indicating that the model only generates synthetic images that are similar to those in the original image set. The uncertainty in estimating and can also be computed. In our experiments we have defined as the shuffle distance between two images. Shuffle distance is the mean of the minimum absolute difference between each pixel/voxel in one image, and the pixels/voxels in a shuffle neighbourhood of radius around the corresponding pixel/voxel in a second image. When , this is equivalent to the mean absolute difference between corresponding pixels/voxels, but for larger values of the distance increases more smoothly as the misalignment of structures in the two images increases.
The overlap-based and model-based approaches were validated and compared, using a dataset consisting of 36 transaxial mid-brain slices, extracted at equivalent levels from a set of T1-weighted 3D MR scans of different subjects. Eight manually annotated anatomical labels were used as the basis for the overlap method: L/R white matter, L/R grey matter, L/R lateral ventricle, and L/R caudate. The images were brought into alignment using an NRR algorithm based on MDL optimisation [2]. A test set of different mis-registrations was then created by applying smooth pseudo-random spatial warps (based on biharmonic Clamped Plate Splines) to the registered images. Each warp was controlled by 25 randomly placed knot-points, each displaced in a random direction by a distance drawn from a Gaussian distribution whose mean controlled the average magnitude of pixel displacement over the whole image. Ten different warp instantiations were generated for each image for each of seven progressively increasing values of average pixel displacement. Registration quality was measured, for each level of registration degradation, using several variants of each of the proposed assessment methods.
The results of the validation experiment are shown in Figure 1. Note that is expected to decrease with increasing perturbation of the registration, whilst and are expected to increase. All three metrics are generally well-behaved and show a monotonic response to increasing perturbation. This validates the model-based measures of registration quality, which are shown both to change monotonically with increasing perturbation of the registration and to correlate with the gold-standard approach based on manually annotated ground truth.
The results for different values of (shuffle radius) and all demonstrate monotonic behaviour with increasing perturbation, but the slopes and errors vary systematically. This affects the size of perturbation that can be detected. To make a quantitative comparison of the different methods, we define the sensitivity, as a function of perturbation as , where is the quality measured for a given value of displacement, is the measured quality at registration, is the degree of deformation and is the mean error in the estimate of over the range.
Sensitivity averaged over the range of perturbations shown in Figure 1 is plotted in Figure 2 for all the methods of assessment. This shows that the Specificity measure with shuffle radius 1.5 or 2.1 is the most sensitive of the measures studied, and that this difference is statistically significant.