Friday, May 29, 2009

CFFI-Grovel Rocks!

This weekend, I was working on CL-MPI, my Common Lisp bindings for MPI, a standard message passing interface which allows distributed memory parallel programming based on message passing.I had previously developed and tested CL-MPI with MPICH1.2, and I wanted to see how it worked with the latest version of MPICH2. Since the bindings are implemented with CFFI, I didn't expect any problems. However, the test suite started behaving in a very bizarre manner, crashing on some tests and freezing on others. I was stumped for a while, but the the investigation was an eye-opener: I needed to use a struct defined as follows in MPICH 1.2
typedef struct {
 int count;
 int MPI_SOURCE;
 int MPI_TAG;
 int MPI_ERROR;
#if (MPI_STATUS_SIZE > 4)
 int extra[MPI_STATUS_SIZE - 4];
#endif
} MPI_Status;
so I had declared the foreign struct in my binding source (line 65):
(cffi:defcstruct MPI_Status
 (count :int)
 (MPI_SOURCE :int)
 (MPI_TAG :int)
 (MPI_ERROR :int))
This basically specifies that MPI_Status is a C struct with int slots, and says that the first 4 bytes (int) of the struct is the count field, the next int is the MPI_SOURCE, and so on. Basically a straightforward translation of the C struct definition. This worked great -- until I tried to use the bindings with MPICH2. Running the CL-MPI test suite resulted in mysterious crashes and freezes -- not a happy transition to from MPICH to MPICH2! After thrashing around trying to isolate the problem, it turns out that in MPICH2, the MPI_Status struct definition had been changed to:
typedef struct MPI_Status {
   int count;
   int cancelled;
   int MPI_SOURCE;
   int MPI_TAG;
   int MPI_ERROR;
  
} MPI_Status;
Oops! No wonder everything got messed up -- Note the new 'cancelled' field inserted after the count field. The fields of the struct no longer aligned with the defcstruct defintion above!

OK, so what to do? I could create two different versions of the defcstruct, one which works with MPICH1.2 and the other for MPICH, and then use conditional compilation (e.g., #+MPICH1 #+MPICH2...). The problem with that solution is that if the C definition of MPI_Status changes yet again in a future MPICH release, then we are back to square one.

In this case, the solution is to not directly use the defcstruct to create a foreign struct definition. Instead, CFFI-Grovel can be used to automagically generate the correct defcstruct. This is the file I created to use CFFI-Grovel for CL-MPI. Here (Line 47), I add the CFFI-Grovel declaration:

(cstruct MPI_Status "MPI_Status"
 (count "count" :type :int)
 (MPI_SOURCE "MPI_SOURCE" :type :int)
 (MPI_TAG "MPI_TAG" :type :int)
 (MPI_ERROR "MPI_ERROR" :type :int))
This states that I want to use a foreign (C) struct named "MPI_Status", giving it the Lisp name MPI_Status, and that I am interested in 4 integer fields: count, MPI_SOURCE, MPI_TAG, MPI_ERROR. This declaration does not specify anything about the ordering of the slots in the C struct. It also does not say anything about the completeness of this mapping. In other words, the C MPI_Status struct can contain other fields which are not mentioned in this CFFI-grovel cstruct definition. In contrast, the original CFFI defcstruct used above is a more concrete declaration, which completely specifies the memory layout of the C struct we are mapping. The CFFI-Grovel based code does the right thing for both MPICH1.X and the latest MPICH2, and if future versions of MPICH2 change the MPI_Status struct definition by chaning field order or adding fields, no problem. Lesson learned: before directly declaring a foreign object with CFFI, consider whether CFFI-Grovel might be more appropriate. Using CFFI-Grovel is more robust to changes in the library to which we're binding, and if I had done this to begin with, it would have saved me several hours of painful debugging.

3 comments:

  1. Interesting illustration with cstruct; I didn't realize that it would always pull out the right variables. That is useful.

    One drawback of CFFI-grovel is that the documentation and examples are quite thin. I had to use IOLib for a sample file to copy.

    ReplyDelete
  2. As one of the original authors of CFFI-grovel, though I've since given up maintainership and the project has moved on, I'd like to say - thanks for the kind words!

    Probably someday I should get around to writing and donating the documentation to go with the donated code...

    ReplyDelete
  3. lhealy - I ended up referring to IOLib for examples too. I found that once I saw some of the grovel files in IOLib and got going, the sparseness of docs wasn't a problem, but it would have saved a little time if the docs included more samples.

    dankna - Thanks for making CFFI-grovel available! It's a great tool.

    ReplyDelete