PCA and Projection

The pairing/match assessment phase gets implemented with a regular principal component analysis (PCA), where there is no particular scaling applied, meaning that it is problem domain-neutral. The program requires just over 1.8 GB for a small model to be saved, about 3GB of RAM/swap to run depending on the size of the training set (the weight is dominated by the amount of data stored in memory rather than offloaded to disk). For better performance and handling of larger experiments, a redesign will be needed because, as it stands, the computational servers already stretch to the swap partition, which slows them down considerably.

Looking at the reference description of the implemented method (whose work ours is conceptually derived from), I found some small error in their paper and I can think of better methods that would achieve better results given enough time. The usage of PCA, for example, could be improved by undertaking a proper model-fitting task - a stage for treating the whole problem as one of optimisation, where the varied parameters are the high-ranked Eigen coefficients. The original work uses the projection from which squared errors are extracted (could use Hotelling's T-squared statistic instead). It is too optimistic an approach that incorporates hacking around it with truncation, binary masks, or zero assignments en masse (removal of what fits poorly, without clear description of how, just why). Evidently it does give some decent results, but there is room for suggestion and/or belief that it's far from optimal. They do speak about the artefacts we too get at the borders, they name a threshold in millimetres, but there are no formal and specific details about the method being used (this is not the only example of missing details about more opaque parts of the said methods). We can, in due time, reverse-engineer - so to speak - their pertinent set of algorithms, but for the time being there is some guessing and generally an incomplete pipeline, especially that which is associated with polishing the residuals. It is similar to the problem we used to encounter and then tackle when it came to pre-processing images - a problem which is largely resolved now but still needs more polish (time-consuming and counter to measurable progress).

The code necessary to generate ROC curves is almost complete (projection as an objective function needs further development), but then it becomes just a brute-force routine, which would go to waste if there is still a lot of false signal (or noise) in the processed data. From a paradigmatic point of view, the important pieces are nearly in place and they are packaged in a user-friendly GUI and easy-to-follow functions, which are properly interfaced too (needed for script-based looping and automation). The syntactic side of things has room for improvement, e.g. vectorisation as means of replacing loops that MATLAB won't optimise away.

Roy Schestowitz 2012-01-08