PDF version of this entire document

Improving GMDS

While we don't have a readily available code for GMDS with on-demand distance computation, the plan is to implement this. We do have the code for efficient fast marching and software cache that combined together and plugged into the old code (or better, the C++/new code) can do the job. At present, the way we utilise GMDS is the hybridisation of the C++-based stress minimiser, the C implementation of FMM, and gmds.m as the wrapper which puts together those fast implementations of key pertinent parts (substitution of analogous MATLAB implementations with slightly different interfaces). Currently the bottleneck is associated only with the number of sample points that make up the two surfaces. The MATLAB profiler might provide insight into what part exactly takes up all the RAM/CPU and to what extent. Another concern has been (and verified by others too) that when the number of vertices is increased, the program becomes more prone to freezes/crashes that are neither consistent nor easily reproducible. Carmi and I never looked too deeply into it, but overcoming these limitations would definitely make GMDS more featureful and robust, not to mention suitable for a plethora of new applications where resolution is high (detail at micro-level and not macro-level/topology).

it remains to be decided what language we will want this implemented in. MATLAB would obviously not be as fast, but currently, considering the scale of the experiments and the access to 2 computational servers with 8 cores each, it probably would be simpler and also work seamlessly across platforms as it's being interpreted. Modifications have already been made to some GMDS code, e.g. to highlight stress in Voronoi-style visualisation, so the task of handling MATLAB code is less daunting than education oneself about the object/class system in the C++ case.

To summarise the situation, hitherto we've investigated the potency of GMDS as a fine-level similarity measure for pairs of surfaces that are carved around a central reference point with geodesic circles around it, or even several such points (the union of those). There are two unique aspects to this approach, one of which is the implicit assignment of weights to distance variation based on PCA, applied by studying local variation in a model-building stage. One other aspect is the hybridisation of measurements, e.g. Euclidean and geodesic using principal component analysis. In GMDS, the main drawbacks are scale and speed. To an extent, speed was improved significantly owing to the C++ implementation, however it remained unclear how to handle a very dense tessellation of the surfaces - dense enough to preserve the amount of information inherent in the original range images. One caveat was the repeated computation - and optimisation over - geodesic distances. By reducing the complexity of this process, e.g. by stochastic selection of points or by caching some certain distances (maybe propagating one onto the other more efficiently, where appropriate), it seems likely if not just hoped that the process will scale better and perhaps deal with even 50,000 triangles. At the moment, trying more than a few thousands leads to technical problems. This problem is reaffirmed by others who encountered the same bottlenecks. The limitation means that general topology can be determined very well, but at a very fine scale where distances are minor and not topological, the problem needs to be downsized, i.e. the triangulation made considerably coarser. While we get promising recognition (pairing) results, in order for them to be close to perfect it is probably necessary to use the full information and not a sub-sampling of it. Right now we use about a tenth of the available entropy which tells about one surface from the other.

Alex B. ran few experiments with the ideas of software caching and on-demand computation of distance fields and was able to cope with around 10K vertices in MATLAB without particular difficulties. He thinks we ought to explore this direction. If we have some of the modified/hacked code around, even in very rudimentary form, then this is a reasonable path ahead. Maybe we can try to re-plug that in and tidy up a bit, then return it for consideration in upstream inclusion.

According to institutional regulations, the end of October is the final date for your temporary employment. There are two options for continuing this project and one is to continue making gradual progress. We need outside support for that resolution issue. that is partly implementational We have been speaking to Alex regarding resolution and are quite dependent on knowing whether we get it done because other paths of exploration - while viable - are not ideal given the trajectory we are after, namely PCA and GMDS combined and a performance decent enough to bring about something publishable.

The only limitation at the moment is the resolution and it is not just a performance/duration issue if RAM constraints stand in the way (which they do). Had it been possible to run full-resolution experiments very slowly as proof of concept, then the practical benefits would be possible to assure, i.e. it would be possible to show that given necessary optimisations, the required outcome will be good and also rapidly reachable. As attempted have already been made to code the necessary improvements, it would be unwise to code them from scratch again, which is why we waited for Alex and gently reminded him about it.

It ought to be added that all prior results and ROC curves also used low resolution images because in the case of PCA, for instance, RAM was exhausted if the sample of observations, $I$, was too large for principal component analysis (covariance matrices get vast due to a quadratically-increasing order of complexity). Values of $i$ were selected based on a down-sampled and at times averaged representation of surfaces. This served as a baseline and was consistent with the emulated method.

Dealing with multi-dimensional problems at such a high scale and still amassing all the fine details may require a multi-grid approach and it is easy to envision how this would be implemented.

At an imminent point we have to change you status from temporary to research fellow (at the end of October. The cost is a bit higher and there are some local social benefits associated with it. Progress needs to be done and seen.

Main ROCs or something of that sort should have been our base line. All ROC curves seen so far are produced based on a recognition task of the order of magnitude of icons, i.e. 32x32=~1000, where those sample points are either triangles or cloudpoints (i.e. 3-D coordinates, not just pixel value). The way we approach this problem is adopting the supposition that we can encode surfaces compactly and then use these compact descriptions (e.g. geodesic distances) to tell surfaces apart. The problem in this case is clear; while it should be possible to sample a distance on the surface, without having high resolution at hand the distances are sampled on a coarse grid and therefore they become imprecise. In order to overcome this, density needs to come into play such that it uses the high-resolution image and then yields a shorter description. Moving from micro to macro would help here, not topology-wise, but measurement-wise. Multi-grid for GMDS is something we would definitely like to explore. One possibility that would be easy to consider is as follows: we take an overview of the problem to get topological information. This can be done reliably in many different ways, even with GMDS. Then, by taking smaller chunks of the problem and applying GMDS to them we can potentially perform measurements with the full resolution (more triangles) in tact.

There are some issues with the current way we interpolate distances within the GMDS. Distance interpolation may violate the distance properties. This could happen at a much smaller scale than being significant, yet for delicate comparisons it could be a pain. There are some remedies we are aware of (like replacing geodesic distances by diffusion distances, in which case interpolation is done on the eigenfunctions of the Laplace-Beltrami operators rather than the actual distance, so distance property is always preserved).

While diffusion distances would be more robust, if they take as their starting point something which is spaced out too sparsely they will fail to discern anatomical differences that are not just sub-pixel scale but also multi-pixel scale. The curse of dimensionality has always been our greatest enemy here and perhaps de-emphasised all along. The way GMDS works quite strictly requires that this limitation gets taken into account, at least in its current implementation. The fact that others suffered or at least encountered this limitation (with ongoing implementation attempted) means that it's a real caveat that, if properly addressed, widens the applicability of GMDS to a plethora of new tasks.

There is still work to be done even with pure geodesics. The geodesic measures may be fantastic, but it is not their fault that they operate on a poor gridding system or a coarse mesh when handling a task where topology is almost always the same and the real changes are very minor.

Roy Schestowitz 2012-01-08