To illustrate the application of model-based evaluation in practice, we compared the NRR results obtained using three different methods for registering a group of images, as described in more detail below. We wished to establish whether it was possible, in a practical setting, to detect significant differences in performance between different NRR algorithms. All three registration methods used the same piecewise affine representation of image warps  and the same multi-resolution optimisation framework. The same number of iterations (function evaluations) were used in each case.
We applied the three registration algorithms to two datasets. The MGH Dataset was used because it allowed the evaluation results obtained using Specificity and Generalisation to be compared with an evaluation based on the Generalised Overlap measure (using ground truth). For these experiments 82#82 = 500 synthetic images were used to estimate Specificity and Generalisation. The Dementia Dataset was used because it was more representative of a typical clinical study, and we wished to demonstrate that our results were not dataset-specific. For these experiments we used 82#82 = 1000 synthetic images.
The three registration methods we used were as follows.