Measuring Model Quality

Correspondence-by-optimisation requires an objective function that provides a measure of model quality for a given set of correspondences. In early work, Kotcheff and Taylor [] proposed the use of the determinant of the shape covariance matrix ( $\det({\bf D}) = \prod\limits_{i=1}^{n} \lambda_{i}$ ) as a measure of model quality, arguing that a good correspondence should result in a model with a small distribution in shape space. They showed encouraging results but did not provide a sound theoretical justification for the approach.

There are other issues with this approach. When generating something simple like a 2-D distribution of points, it is trivially possible to show what happens if the determinant changes and one simple way to do this is by generating data from a 2-D distribution, then looking at the determinant as noise is added to the point positions. As the distribution changes, the determinant changes too, but a key issue is the need to prevent small modes from almost zeroing the determinant. Later on in this chapter I will discuss a solution to this.

The key deficiency with determinant as a parameter is that it is a bit arbitrary, so Davies et al. [] built on this work, providing the MDL objective function. With MDL, their basic idea was that a good model ought to allow a simple description of the training set - in an information theoretic sense. This MDL objective function estimates the information (in bits) required to encode two parts of the coding scheme, namely the parameters of the model and the training set of shapes encoded using that model. Model parameters are related to the number of model modes (degrees of variation) and the second part scales in proportion to the number of examples in the training set. In cases where the training set is large, that latter component dominates and outweighs the former, in which case the use of model determinant as an approximation can be justified, at least for a multivariate Gaussian [].

For one-dimensional models, the shapes and models were encoded using an expression that considers the family of such distributions and is computed by summing two parts; the description length for encoding value and accuracy of a parameter along with the description length for encoding the value itself using the model. To transmit a one-dimensional model, one needs to calculate, based on Shannon Coding codeword length [], a statistical model with event probabilities. This was done by quantising the centered data set and instead obtaining a set of discrete events. To code the parameters, one must also consider the accuracy in order for a receiver of the message to be able to decrypt it. The total code length was the sum of the lengths of the data describing parameter accuracy and the quantised parameters. It was also necessary to code the data itself. In this case of a model, what remained was an accumulation of the following components:

The accuracy;
The parameters (essentially the encoded data).

The objective function considered a number of directions of the normal distribution, where the data was aligned wrt the mean shape. The directions were ordered based on the magnitude of their eigenvalue or variance and the objective function was the sum of the description length of each such direction in turn.

The MDL objective function has been discussed in an abstract form so far, along with some various steps in its derivation, but to formulate it symbolically (where represents description length):

$\begin{displaymath} L_{total}=L_{model}+L_{data}\end{displaymath}$

(4.1)

Breaking it down further, let us suppose a full description of our model means specifying a set of $n_{p}$ parameters $\{\alpha_{n}:n=1,\ldots n_{p}\}$ . To get a finite message length, we obviously can only transmit parameters to some finite number of decimal places (or to some finite accuracy), hence we also have to consider the accuracy of each parameter, call it $\delta_{n}$ . The description length for the entire model can then be written as:

$\begin{displaymath}L_{model} = \sum\limits_{n=1}^{n_{p}} (L_{parameter~accuracy}(\delta_{n}) + L_{parameter}(\alpha_{n})). \end{displaymath}$

(4.2)

This is the first component among two and the precise derivation will be shown here for completeness. The parameters in the first component ( $L_{model}$ ) are derived and coded as follows.

It is possible, without loss of generality, to consider the case of just a single parameter and drop the subscript for clarity. What actually gets transmitted is not the raw parameter value $\alpha$ , but the quantized value $\hat{\alpha}$ , described to an accuracy $\delta$ , that is:

$\begin{displaymath}\hat{\alpha} = n\delta, \:\: n \in \mathbb{N}.\end{displaymath}$

(4.3)

We also consider that the possible values of the parameter are bounded:

$\begin{displaymath}\alpha_{min} \le \hat{\alpha} \le \alpha_{max}. \end{displaymath}$

(4.4)

Since there is no prior knowledge in this case, the distribution of $\hat{\alpha}$ over this range is assumed to be flat, which gives the codeword length:

$\begin{displaymath} L_{\hat{\alpha}}=\ln(\frac{\alpha_{max-}\alpha_{min}}{\delta}).\end{displaymath}$

(4.5)

The receiver of this cannot decrypt the value of $\hat{\alpha}$ without also knowing the value of the accuracy $\delta$ , so the quantisation parameter $\delta$ may take the following form:

$\begin{displaymath} \delta=2^{\pm k}\end{displaymath}$

(4.6)

where $k\in\mathbb{N}$ . This can be coded directly with codeword length, namely:

$\begin{displaymath} L_{\delta}=1+\vert\log_{2}\delta\vert\,\mathbf{bits}\approx1+\vert\ln\delta\vert\,\mathbf{nats}.\end{displaymath}$

(4.7)

$\vert\log_{2}(\delta)\vert$ is just the length of the integer $\vert k\vert$ , whereas the additional bit is for encoding the sign of . The total length for transmitting the parameters is then the sum of $L_{\hat{\alpha}}$ and $L_{\delta}$ .

There is also the sum over description lengths of training examples (the second component of the coding formulation). Although this does not give something which is the absolute minimum length, it takes a form which is suitable for optimisation. The total description length, $L_{total}$ , is the sum of the two parts where for $N_{s}$ shapes that I arbitrarily call $\mathbf{S}$ , the second term is:

$\begin{displaymath} L_{data}=\sum_{N=1}^{N_{s}}(L_{coding\, of\, data\, using\, model}(S_{N})).\end{displaymath}$

(4.8)

The training examples are encoded as follows. We will consider the case of 1-D centered and quantised data (with quantisation parameter $\Delta$ ), $\{\hat{y}_{i}:i=1,\ldots N_{s}\}$ , which is encoded using a 1-D Gaussian $\rho(k,\hat{\alpha}) \propto \exp(-k^{2}/2\hat{\alpha}^{2}).$ For a bin centered at $\hat{y}$ , the associated probability is:

$\begin{displaymath} P(\hat{y})=\intop_{\hat{y}-\Delta/2}^{\hat{y}+\Delta/2}dk\rh... ...}{\hat{\alpha}\sqrt{2\pi}}\exp(-\hat{y}^{2}/2\hat{\alpha}^{2}).\end{displaymath}$

(4.9)

Numerically, it can be shown to be a decent approximation for all values $\hat{\alpha}\geqslant2\Delta$ , so $\alpha_{min}$ is now just assumed equal to $2\Delta$ and the code length of the data is

$\begin{displaymath} L_{data}=-\sum_{i=1}^{N_{s}}\ln P(\hat{y}_{i})=-N_{s}\ln\Del... ...)+\frac{1}{2\hat{\alpha}^{2}}\sum_{i=1}^{N_{s}}\hat{y}_{i}^{2}.\end{displaymath}$

(4.10)

Then, it is possible to obtain the variance of the quantised data:

$\begin{displaymath} \alpha^{2}=\frac{1}{N_{s}}\sum_{i=1}^{N_{s}}\hat{y}_{i}^{2}\end{displaymath}$

(4.11)

and additionally, $\alpha_{max}=R.$ $\alpha$ is still different from the nearest quantised value $\hat{\alpha}$ so $\hat{\alpha}=\alpha+d_{\alpha},\,\vert d_{\alpha}\vert\leq\frac{\delta}{2}$ .

By averaging over an ensemble of different sets and assuming the distribution for $d_{\alpha}$ is flat over this range, it is found that:

$\begin{displaymath} \left\langle d_{\alpha}^{2}\right\rangle =\frac{\delta^{2}}{... ...}}\approx\frac{1}{\alpha^{2}}(1+\frac{\delta^{2}}{4\alpha^{2}})\end{displaymath}$

(4.12)

and

$\begin{displaymath} \ln\hat{\alpha}^{2}\approx\ln\alpha^{2}-\frac{\delta^{2}}{12\alpha^{2}}.\end{displaymath}$

(4.13)

Finally, by combining these expressions, it turns out that description length of the data is

$\begin{displaymath} L_{data}=-N_{s}\ln\Delta+\frac{N_{s}}{2}\ln(2\pi\alpha^{2})+\frac{N_{s}}{2}+\frac{N_{s}\delta^{2}}{12\alpha^{2}}.\end{displaymath}$

(4.14)

and so

$\begin{displaymath} L_{total}=1+\ln(\frac{\alpha_{max-}\alpha_{min}}{\delta})+\v... ...lpha^{2})+\frac{N_{s}}{2}+\frac{N_{s}\delta^{2}}{12\alpha^{2}}.\end{displaymath}$

(4.15)

It is worth noting that the MDL objective function includes the term $\alpha^{2}$ (the variance), which is closely related to Kotcheff and Taylor's [] determinant of the model covariance matrix (product of eigenvalues, or $\det({\bf D}) = \prod\limits_{i=1}^{n} \lambda_{i}$ ). The latter often dominates and is a lot cheaper to compute. In essence, $\frac{N_{s}}{2}\ln(\alpha^{2})\rightarrow\frac{N_{s}}{2}\ln(\lambda)$ and as $\lambda=\alpha^{2}$ , for more directions (a multivariate rather than a 1-D Gaussian) the term would be

$\begin{displaymath} \begin{array}{c} \\ \sum\\ i\end{array}=\ln(\lambda_{i})\approx\ln(\det(\mathbf{D})).\end{displaymath}$

(4.16)

There are more details about this relationship in []. Since the determinant roughly measures the volume which the training set occupies in space, this favours compact models. The description length of this determinant is

$\begin{displaymath} L_{det}=\ln(\det(\mathbf{D}))=\ln(\begin{array}{c} n\\ \pro... ... i=1\end{array}\lambda{}_{i})=\sum_{i=1}^{n}\ln(\lambda{}_{i}).\end{displaymath}$

(4.17)

The definition of the covariance matrix itself was given in Chapter 3 in the context of PCA. In Chapter 3, was also defined to be the number of modes and here is the number of modes fitting the expression. This MDL objective function can be conveniently approximated by the determinant-based objective function, so my methods will be inspired by it. One of the deficiencies, as Chapter 5 will explain, is that the determinant - being a product - can be negatively affected by tiny eigenvalues.

Roy Schestowitz 2010-04-05