We have described a model-based approach to evaluating the results of NRR of a group of images. The most important advantage of the new method is that it does not require any ground truth, but depends solely on the registered images themselves.
We have validated the approach by studying the effect of perturbing, progressively, the registration of an initially registered set of images, comparing the results with those obtained using a 'gold standard' measure based on the overlap of ground-truth anatomical labels. We have shown that our new method provides measures of registration accuracy that are monotonic functions of the known misregistration, and that one, specificity, provides a more sensitive measure of misregistration than the approach based on ground truth. The model-based approach requires a distance measure in image space, and we have also demonstrated that the use of shuffle distance, rather than Euclidean distance, improves the sensitivity of the approach.
We have further validated the approach and illustrated its application by performing a comparative evaluation of the results obtained using three different NRR algorithms, demonstrating the superiority of a fully-groupwise algorithm over a repeated pairwise approach.
The experiments were performed in 2D to limit the computational cost of running a large-scale evaluation for a range of parameter values and with repeated measurements. The extension to 3D is, however, trivial, though the calculation of shuffle distance for 3D images increases the computational cost significantly.
In the experiments we have reported we used linear appearance models in the evaluation, but any generative model-building approach could, in principle, be used. It is important to emphasise that the method is not restricted to evaluating model-based NRR algorithms, though we presented results for one such approach; our model-based measures of registration accuracy can be applied to any set of non-rigidly registered images, however they were obtained.
At first sight, the result that one of the model-based measures is more sensitive than the method based on the overlap of ground-truth labels seems counter-intuitive. On further reflection this is not, however, so surprising - since the model-based approach uses the full intensity image, which provides a far richer description of local alignment than that provided by the relatively featureless label images.
Overall, we believe that our approach provides a powerful approach to evaluating NRR methods, allowing subtle differences to be detected without the need for any additional information. This should prove valuable both in helping to guide the development of new NRR methodology and in providing quality control in routine applications of NRR.