Schestowitz Wiki

Assessing the Accuracy of Non-Rigid Registration With and Without Ground Truth

R. S. Schestowitz^{1}, W. R. Crum^{2}, V. S. Petrovic^{1}, C. J. Twining^{1}, T. F. Cootes^{1} and C. J. Taylor^{1}

of Manchester Stopford Building, Oxford Road, Manchester M13 9PT, United Kingdom

Computer Science, University College London, Gower Street, London WC1E 6BT, United Kingdom

Non-rigid registration (NRR) has over the past few years been increasingly used as a basis for medical image analysis. It was proposed for registering both pairs and groups of images. The problem is highly under-constrained and a host of algorithms that have become available will, given a set of images to be registered, in general, produce different results. We present two methods for assessing the performance of non-rigid registration algorithms. The former method is based on measuring overlap of ground truth labels between the registered images using Tanimoto's formulation. The latter method assesses registration as the quality of a generative statistical appearance model constructed from registered images using the concepts of model specificity and generalisation and requires no ground truth. We compare the two methods and show them to be in agreement with the latter being more sensitive.

The first among methods to be described relies on the existence of ground-truth data such as boundaries of image structures, produced by manual markup of distinguishable points. Having registered an image set, the method can measure overlap between structures that have been annotated, thereby implying how good a registration was.

Our latter method is able to assess registration without ground truth of any form. The approach involves automatic construction of appearance models from the registered data, subsequently evaluating, using model syntheses, the quality of that model. Quality of the registration is tightly-related to the quality of its resulting model and the two tasks, namely model construction and image registration, are innately the same one. Both involve the identification of corresponding points, also known as landmarks in the context of model-building. Expressed differently, a registration produces a dense set of correspondences and models of appearance require the images and these correspondences in order to be built.

To put the validity of both methods to the test, we assembled a set of 2-D 38 MR images of the brain. Each of these images was carefully annotated to identify different compartments within the brain. These anatomical compartments can be perceived as simplified labels that faithfully define the structure of the brain. Our first method of assessment uses the Tanimoto overlap measure to calculate the degree to which labels across the image set concur. In that respect, it exploits ground truth, which has been identified by an expert, to reason about registration quality.

The second method takes an entirely different approach. It feeds on the results of a registration algorithm, where correspondences have been highlighted, and builds an appearance model given the images and their correspondences. From that model, many synthetic brain images are derived. Vectorisation of these images allows us to embed them in a high-dimensional space. We can then compare the spatial cloud that these synthetic images form with the cloud that gets composed from the original image set – the set from which the model has been build. Computing the overlap between these clouds gives insight into the quality of the registration. Simply put, it is a model fit evaluation paradigm. The better the registration, the greater the overlap between those clouds will be.

To compute overlap between two clouds of data, we have devised measures that we refer to as Specificity and Generalisablity. The former tells how well the model fits its seminal data, whereas the latter tells how well the data fits its derived model. It is a reciprocal relationship that 'locks' data to its model and vice versa. We calculate Specificity and Generalisablity by measuring distances in space. As we seek a distance measure that is tolerant to slight anatomical differences, we use the shuffle distance, not neglecting to compare it against Euclidean distance. The shuffle distance compares each point in one image with a larger corresponding region in another image. It adheres to the best fit, i.e. matches the two points whose distance is minimal.

Our assessment framework, by which we test both methods, uses non-rigid registration, whereby many degrees of freedom are involved in image transformations. To systematically generate data over which our hypotheses can be tested, we perturb the brain data using clamped-plate splines, which are diffeomorphic. In the brain data which we use, correspondences among images are said to be perfect so they can only ever be degraded. We wish to show that as the degree of perturbation increases, so do the measures of our registration assessment methods.

In an extensive batch of experiments we perturbed the datasets at progressively increasing levels, which led to well-understood mis-registration of the data. We repeated these experiments 10 times to demonstrate that both approaches to assessment are consistent and results are unbiased. Having investigated and plotted the measures of overlap for each perturbation extent, we see a rather linear decrease in the amount of overlap (Figure X). This means that, when ground-truth-based registration is eroded, the overlap-based measure is able to detect that and the response is very well-behaved, thus meaningful and reliable.

  <Graphics file: ./Graphics/2.eps>

        Figures X&Y. The measured quality of registration as perceived
        by the overlap-based evaluation (left) and the model-based
        evaluation (right).

We then undertake another assessment task, this time exploiting the method which does not make use of ground truth. We notice a very similar behaviour (Figure Y), which is evidence that the latter is a powerful and reliable method of assessing the degree of mis-registration – or conversely – the quality of registration.

As a last step, we embark on the task of comparing the two algorithms, identifying sensitivity as the factor which is most important. Sensitivity reflects on our ability to confidently tell apart a good registration from a worse one. The slighter the difference which can be detected reliably, the more sensitive the method. To calculate sensitivity, we compute the amount of change in terms of mean pixel displacement – deviation from the correct solution, that is. We then look at differences in our assessor's value, be it overlap, or Specificity, or Generalisation. We also must stress the need to take account of the errors bars as there is both an inter-instantiation error and a measure-specific error; the two must be composed carefully. The derivation of sensitivity can be expressed as follows:

placeholder

where X is… (TODO)

        Figure Z. The sensitivity of registration assessment methods.
        note to self: exclude Gen.? Combined? Plots?
        -.

Figure Z suggests that, for roughly any selection of shuffle distance neighbourhood, the method which does not require ground truth is more sensitive than the method which depends on it. When the trends of these curves are inspected closely, it can be observed that they are approximately parallel, which implies that the two methods are very closely correlated.

In summary, we have shown two valid methods for assessing non-rigid registration. The methods are correlated in practice, but the principles they build upon are quite separable. Their pre-requisites – if any – likewise. Registration can be evaluated with or without ground-truth annotation and the behaviour our measures is consistent across distinct datasets, is well-behaved, and is sensitive. Both methods have been successfully applied to assessment of non-rigid registration algorithms and both led to the expected conclusions. That aspect of the work, nonetheless, is beyond the scope of this paper.

Schestowitz Wiki

User Tools

Site Tools

Page Tools