PDF version of this entire document

C++ FMM Debugging

We have just managed to stick together GIP code, TAU FMM, and the C++ stress minimiser, with the imminent goal of lumping the cache in to make things scale more gracefully. But there is still some bug that sometimes leaves distances between points in their initial value of infinity, which causes GMDS to cough out an error (at least not a crash). This requires some more debugging.

The causes of crashes are known now, so the task which remains to be tackled has involved modifying the GMDS code to satisfy other preconditions (for instance in farthest point sampling). Moreover, since we wish to recalculate distances the least number of times (for everything to scale properly), separation between initialisation and marching needs to be thought through, with the ultimate goal of having a much faster and more scalable GMDS implementation, preferably backward-compatible with older interfaces. This has so far been a debugging problem delving deep into the guts of GMDS.

A stability problem persists in the newly-compiled version of FMM, which causes segmentation faults on seemingly random events, either at init, deinit, or march operations. When everything runs without crashes, the results are as expected. The crashes can also be reproduced using toy data, e.g. by running it many times until a segmentation fault occurs. The changes made to make the code compilable are minor and should be quite neutral.

About 5 hours were spent pinpointing the cause of the crash on GIP servers (both were attempted in case one had an unlikely malfunction). The memory allocation in Linux systems (GIP servers run Ubuntu) requires that casting gets changed in Ronnie Kimmel's adapted FMM files, which had been made more modular. The most mysterious thing about the crashes is that they are seemingly random in the sense that malloc'ing and pointing to a direct buffer would work fine for a given size/number of vertices and triangles, but on any subsequent allocations - and this cases happen at random with no obvious pattern found so far - a segmentation fault would be reported and the program suspend itself. The error occurs when accessing an FMM object (accessing a private parameter for example, through public functions, but not necessarily just that). This is being investigated with addresses being displayed as there is a strong suspicion that particular memory allocations with the old trick of casting (as unsigned integer, not intptr_t) throw a wobbly and make particular addresses not addressable (or occupied by another program). The whole process has required studying Ronnie's implementation, which was altered quite significantly and should probably be made compatible with the computational servers at the lab (unless another binary exists somewhere). It does generally yield the correct distances when it works, but the crashes prevent it from being testable in programs that recalculate distances (as they ought to). This generally slows down development of other parts of the programs and it needs to be resolved first, not brushed under the rug (as tempting as it may be).

We checked if the program been compiled for a GNU/Linux system before. Since the memory problem in this program is being debugged with several identical dataset in series (never to be predictably repeatable for crashes, as random parameters and new memory address are being instantiated), it might make sense to just rewrite this portion of the program, although it might inadvertently break other parts of the program, with which we need to become intimately familiar (although knowledge about it is vastly improved after many hours of debugging).

On a 64-bit Ubuntu server, creating/mallocing for an FMM sometimes has the object allocated a memory segment with address 28 bits long (0x0*******), but when it's fully 32 bits (e.g. 0xc3637a456) there is always a segfault. The problem was narrowed down to it. This seems like an architectural issue and it limits all other work. This needs to be runnable on the server.

Roy Schestowitz 2012-01-08