The results of the validation experiment reported in Section are the most important outcome of the work presented here. They demonstrate a causal relationship between our Specificity and Generalisation measures, and a known (up to an additive constant) mean pixel displacement, 79#79. A strong correlation between these model-based measures and a Generalised Overlap measure, based on ground truth, adds further weight to this interpretation. The fact that the relationship with 79#79 held good over many different instantiations of a very general class of perturbing warps, makes it unlikely (though not impossible) that there is any significant pattern dependence.
The results obtained with added noise are also encouraging, since it is a reasonable concern that the use of an intensity-based distance measure might make the model-based measures sensitive to noise. In the event, the approach seems robust to quite significant levels of noise. The fact that the absolute values of specificity and generalisation change when noise is added, mean that they would not be useful for comparing registration results for different image sets. Their ability to compare the performance of different registration algorithms applied to the same set of images, the main intended use, is, however, unaffected.
Our results comparing the performance of different registration algorithms demonstrate that the model-based measures, and Specificity in particular, are sufficiently sensitive to misregistration to provide useful discrimination in a practical setting. There is, however, a potential concern that it is important to address. It might be argued that using a model-based approach to assessing registration favours methods which use a model-based objective function for registration (as in the experiments reported here). In practice, we do not believe that this is a problem.
First, as we have argued above, our validation results show that there is a causal relationship between the mean pixel displacement, 79#79, and Specificity/Generalisation. It is thus irrelevant how a registration (or misregistration) has been obtained. Second, the MDL objective function we optimise in our model-based registration method measures a quite different property of the model to those we use in evaluation, so there is no element of 'self-fulfilling prophecy. In an ideal world it would, of course, be preferable to avoid even the possibility of bias, though it seems unlikely that one could devise a strategy for evaluation that had no relevance to achieving a good registration in the first place. We hope that, in due course, other ground-truth-free methods of evaluation will be developed, allowing a multi-perspective assessment of performance.
One obvious limitation of our approach to evaluation is that it can only be applied to groups of images. This could be considered an important restriction, since many practical applications involve registration of pairs or very short temporal sequences of images. We would argue that, in fact, this is a necessary restriction, because it is only possible to arrive at a meaningful assessment of registration in the context of a population of images.
The experiments we have reported were performed in 2D to limit the computational cost of running the large-scale evaluation for a range of parameter values and with repeated measurements. The extension to 3D is, however, trivial, though the calculation of shuffle distance for 3D images increases the computational cost significantly. We have implemented the method in 3D and the time taken to evaluate the registration of 100 190x190x50 images using a shuffle radius of 2.1 and 104#104 is around 62.5 hours on a modern PC, which is short compared to most registration algorithms.
There are a number of issues that merit further investigation. We have studied a particular method of measuring image separation, but others, such as local correlation, would be worth exploring. Another interesting issue is whether it is possible within this framework to localise registration errors. We have performed some initial experiments, summing the shuffle difference maps between all pairs of images in the registered set. This gives some interesting results, highlighting areas of common misregistration, but it is not clear what quantitative interpretation could be placed on such maps. Finally, it is clear that our current measures of Specificity and Generalisation are not normalised - their values depend on the size of the set of registered images, the number of synthetic images generated and so on. We are currently exploring the possibility of measuring more fundamental properties of the relationship between the real and synthetic image distributions, with a view to achieving a 'natural' normalisation.