Sunday, May 31, 2009

Parallel Programming in Common Lisp

To kick off a series of posts about parallel programming in Common Lisp, let's first review what's available.

Two models for parallel programming

First, there are basically two models for parallel programming:

  • Parallelism based on threads (a.k.a. shared-memory model). Multiple threads are spawned by a single process. This is the most common approach in a multi-core machine. For example, we can spawn one computational thread per core. Memory can be shared among threads, so communications between computations is usually implemented by threads writing to shared memory spaces.

  • Parallelism based on message passing (a.k.a. distributed memory model). This is the most common and natural approach in a cluster of machines, because in a cluster of commodity machines, there's no hardware support for sharing RAM among multiple machines. Communication among computations is done by explicitly sending message structures according to some messaging protocol.

It is possible to mix these two models. That is, we can have message passing among multi-core machines, and within each machine, use thread-based parallelism.

Threads in Common Lisp

When discussing threads, we first have to distinguish between native threads and non-native threads. Native threads are handled by the operating system. Native threads allow us to take advantage of multiple cores. For example, in a 4-core machine, we can spawn 4 computational threads that are executed in parallel. Non-native threads are handled by the Lisp implementation, and do not allow true parallelism. If we spawn multiple non-native threads, they are executed sequentially (the threads yield control, usually cooperatively, to each other). So, when talking about parallel processing, we are really interested in native threads.

Some languages such as Java provide a standardized threading library. The Common Lisp standard does not include threads, so it is up to the implementations to provide threading facilities. Although there is no standard thread API for Common Lisp, Bordeaux-Threads is an attempt to provide a portable layer on top of the implementation-specific APIs.

Here's a table of threading support for most of the current Common Lisp implementations.

Native Non-native
ABCL all platforms -
Allegro Windows other platforms
CLISP - -
CMUCL - x86 (Linux,BSD)
Clozure CL all platforms -
Corman CL Windows -
GCL - -
Scieneer all platforms -
SBCL X86 (Linux,BSD) -
So, native thread support is not universal. If we're interested in parallelizing computations in Common Lisp, this is unfortunate, if a particular implementation of interest happens to not support native threading on the selected platform. You might think that since SBCL offers native threading on x86, and Clozure CL offers native threading everywhere, we're in pretty good shape. However, suppose that we want to use CLISP because the computational bottleneck in our application is bignum arithmetic, and CLISP has much faster bignum arithmetic than the other implementations -- no dice. Also, a particular application may require the use of non-portable code or libraries which require a particular implementation that doesn't support native threads for a required platform.

Message-Passing in Common Lisp

Message passing is less commonly used than threading in most languages (with the notable exception of Erlang). As far as I know, there is only one implementation which supports parallelism based on message passing, which is GCL, which includes an interface to MPI, originally by Gene Cooperman.

Coming up: A New, Portable Message-Passing Library for Common Lisp

I've been working on CL-MPI, a portable, CFFI-based binding for the MPI message passing library. CL-MPI enables portable, parallel programming in Common Lisp using a message-passing model on either a cluster of machines, or a single multicore machine. I have just made the most recent version (0.2.1) available for download. So far, I've tested CL-MPI with SBCL and CMUCL on Linux. I initially developed CL-MPI to do massively parallel computations on a large cluster at my university. But since we can run MPI on a single multi-core machine, one interesting use for CL-MPI is to provide a "poor man's multiprocessing solution" for some Common Lisp implementations which don't have native threading capabilities, such as CMUCL and CLISP. Unfortunately, CL-MPI is not really well documented yet, and as a prerequisite, you'll need to have basic understanding of the MPI programming model (here is an excellent tutorial). I'll write a series of posts describing CL-MPI and some example applications, and hopefully, this will be the basis of a tutorial. Stay tuned...

Friday, May 29, 2009

CFFI vs UFFI performance

The current, de facto portable foreign function interface (FFI) for Common Lisp is CFFI. An alternative to CFFI is UFFI, a slightly older FFI created by Kevin Rosenberg.

The CFFI package includes a compatibility layer for UFFI called CFFI-UFFI-COMPAT, which can be used as a drop-in replacement for UFFI -- in other words, if CFFI is installed on our machine, we can use systems that depend on UFFI simply by changing the dependency in the ASDF system definition from "UFFI" to "CFFI-UFFI-COMPAT". The obvious benefit is that installing UFFI becomes unnecessary.

A surprising side benefit can be improved performance.

I recently tried using CL-GD, a FFI wrapper for the GD graphics library. CL-GD uses UFFI. I implemented a simple function which compares the RGB components of both images.

(defun square (x)
  (the fixnum (* (the fixnum x) (the fixnum x))))

(defun compare-images (image1 image2)
 "Compare two images -- uses API exported by CL-GD"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img-color (cl-gd:get-pixel y x :image image2))
                     (target-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :red img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :red target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :blue img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :blue target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :green img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :green target-color :image image1))))))))
   err))

First, the timing with the standard CL-GD: (SBCL 1.0.28, CL-GD 0.5.6, UFFI-1.6.1, 3GHz Intel Core2 Duo, Linux))

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 1.684 seconds of real time
 1.688105 seconds of total run time (1.688105 user, 0.000000 system)
 100.24% CPU
 5,061,067,326 processor cycles
 141,296 bytes consed
0

1.68 seconds to compare a pair of 1024x768 images is extremely slow. Since iterating over a couple of 1024x768 arrays and computing their difference in Common Lisp doesn't take anything close to 1.7 seconds (see below), it seemed that the bottleneck must be in the FFI.

For many applications, the FFI overheads don't really matter that much, because the goal of using the FFI is to use a library written in another language without having to reimplement the library in Lisp. On the other hand, another reason for using foreign libraries is to use fast libraries implemented in C. Since CFFI is being used for low-level libraries such as IOLib, where performance matters, I thought I'd try using the UFFI compatibility layer for CFFI.

All I did was edit the cl-gd.asd file so that the dependency on UFFI was replaced by a dependency on CFFI-UFFI-COMPAT (CL-GD already depends on CFFI-UFFI-COMPAT for CLISP and OpenMCL, but by default, depends on UFFI for SBCL).

The improvement wasimpressive - 0.336 seconds using CFFI-UFFI-COMPAT, comapred to 1.68 seconds with UFFI, a 5x speedup.

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.336 seconds of real time
 0.336021 seconds of total run time (0.336021 user, 0.000000 system)
 100.00% CPU
 1,010,111,976 processor cycles
 21,712 bytes consed

Of course, there is still a massive overhead due to accessing pixel information through the CL-GD API, which is partially due to the FFI layer, partially due to GD, and partially due to CL-GD. How large is this overhead? Here's a simple function which compares two color arrays written in pure Common Lisp: This is not even that optimized -- it doesn't even declare the types of image1 and image2.

(defstruct color
  (red 0 :type fixnum)
  (blue 0 :type fixnum)
  (green 0 :type fixnum))
(defun compare-2darrays (image1 image2 height width)
 "Compare two images -- uses API exported by CL-GD"
 (declare (fixnum  height width))
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
          (loop for x fixnum from 0 below width do
               (let ((color1 (aref image1 y x))
                     (color2 (aref image2 y x)))
                 (incf err (square (- (color-red color1) (color-red color2))))
                 (incf err (square (- (color-blue color1) (color-blue color2))))
                 (incf err (square (- (color-green color1) (color-green color2)))))))))
The results?
CL-USER> (let ((img1 (make-array '(1024 768) :initial-element (make-color)))
               (img2 (make-array '(1024 768) :initial-element (make-color))))
           (time (compare-2darrays img1 img2 1024 768)))
Evaluation took:
  0.048 seconds of real time
  0.048003 seconds of total run time (0.048003 user, 0.000000 system)
  100.00% CPU
  143,562,681 processor cycles
  0 bytes consed  

0.048 seconds to perform basically the same operations (iterate over both arrays, computing the square of their difference, compared to 0.336 seconds for the code which accesses CL-GD via CFFI.

I considered the overhead within CFFI/UFFI, but there's another source of inefficiency, which is the CL-GD binding code. As clearly stated in the docs, CL-GD wasn't designed for speed, and in order to provide a nice API, there is some overhead in the binding code. I'm abusing CL-GD by using it for repetitive pixel-level access because GD is a image creation library and not an image processing library, but I needed to load and create some GIF files, and the first library I ran into was CL-GD.

Here is another version of the CL-GD based code where I bypass the officially exported CL-GD API and use some internal functions (I know some people don't like it, but being able to access internal symbols with "::" is a great feature of Common Lisp, imho).

(defun compare-images-raw (image1 image2)
 "Compare two images -- use internal functions to eliminate overhead
added by CL-GD API"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1))
       (img1 (cl-gd::img image1))
       (img2 (cl-gd::img image2)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img2-color (cl-gd:get-pixel y x :image image2))
                     (img1-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-red img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-red img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-green img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-green img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-blue img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-blue img1 img1-color))))))))
   err))
The results:
EVO-LISA> (progn
           (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
           (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
           (time (compare-images-raw *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.282 seconds of real time
 0.284018 seconds of total run time (0.284018 user, 0.000000 system)
 100.71% CPU
 845,516,097 processor cycles
 21,936 bytes consed
So we're down to 0.282 seconds from 0.336 seconds. Noticeable, but still a long ways to go from the raw CL code. If this was an important enough example (it's not), I could try, for exmaple, eliminating the FFI overhead by digging around the GD API and possibly modifying it to implement a quicker way to grab an entire image at once, instead of pixel by pixel.

I haven't looked deeper into why CFFI turned out to be 5x faster than UFFI for this example, and there may be examples where UFFI is faster. But the lesson remains the same: (1) FFI can in the wrapper libraries we use can be a peformance bottleneck, (2) CFFI-UFFI-COMPAT offers a quick way to swap in CFFI for UFFI.

I sent a summary of the above to Edi Weitz, author of CL-GD (and many other great libraries!), and he suggested that I send a patch, after more testing -- unfortunately, CL-GD failed a couple of tests in the test suite and I don't have time to dig around right now. Maybe a little later...

CFFI-Grovel Rocks!

This weekend, I was working on CL-MPI, my Common Lisp bindings for MPI, a standard message passing interface which allows distributed memory parallel programming based on message passing.I had previously developed and tested CL-MPI with MPICH1.2, and I wanted to see how it worked with the latest version of MPICH2. Since the bindings are implemented with CFFI, I didn't expect any problems. However, the test suite started behaving in a very bizarre manner, crashing on some tests and freezing on others. I was stumped for a while, but the the investigation was an eye-opener: I needed to use a struct defined as follows in MPICH 1.2
typedef struct {
 int count;
 int MPI_SOURCE;
 int MPI_TAG;
 int MPI_ERROR;
#if (MPI_STATUS_SIZE > 4)
 int extra[MPI_STATUS_SIZE - 4];
#endif
} MPI_Status;
so I had declared the foreign struct in my binding source (line 65):
(cffi:defcstruct MPI_Status
 (count :int)
 (MPI_SOURCE :int)
 (MPI_TAG :int)
 (MPI_ERROR :int))
This basically specifies that MPI_Status is a C struct with int slots, and says that the first 4 bytes (int) of the struct is the count field, the next int is the MPI_SOURCE, and so on. Basically a straightforward translation of the C struct definition. This worked great -- until I tried to use the bindings with MPICH2. Running the CL-MPI test suite resulted in mysterious crashes and freezes -- not a happy transition to from MPICH to MPICH2! After thrashing around trying to isolate the problem, it turns out that in MPICH2, the MPI_Status struct definition had been changed to:
typedef struct MPI_Status {
   int count;
   int cancelled;
   int MPI_SOURCE;
   int MPI_TAG;
   int MPI_ERROR;
  
} MPI_Status;
Oops! No wonder everything got messed up -- Note the new 'cancelled' field inserted after the count field. The fields of the struct no longer aligned with the defcstruct defintion above!

OK, so what to do? I could create two different versions of the defcstruct, one which works with MPICH1.2 and the other for MPICH, and then use conditional compilation (e.g., #+MPICH1 #+MPICH2...). The problem with that solution is that if the C definition of MPI_Status changes yet again in a future MPICH release, then we are back to square one.

In this case, the solution is to not directly use the defcstruct to create a foreign struct definition. Instead, CFFI-Grovel can be used to automagically generate the correct defcstruct. This is the file I created to use CFFI-Grovel for CL-MPI. Here (Line 47), I add the CFFI-Grovel declaration:

(cstruct MPI_Status "MPI_Status"
 (count "count" :type :int)
 (MPI_SOURCE "MPI_SOURCE" :type :int)
 (MPI_TAG "MPI_TAG" :type :int)
 (MPI_ERROR "MPI_ERROR" :type :int))
This states that I want to use a foreign (C) struct named "MPI_Status", giving it the Lisp name MPI_Status, and that I am interested in 4 integer fields: count, MPI_SOURCE, MPI_TAG, MPI_ERROR. This declaration does not specify anything about the ordering of the slots in the C struct. It also does not say anything about the completeness of this mapping. In other words, the C MPI_Status struct can contain other fields which are not mentioned in this CFFI-grovel cstruct definition. In contrast, the original CFFI defcstruct used above is a more concrete declaration, which completely specifies the memory layout of the C struct we are mapping. The CFFI-Grovel based code does the right thing for both MPICH1.X and the latest MPICH2, and if future versions of MPICH2 change the MPI_Status struct definition by chaning field order or adding fields, no problem. Lesson learned: before directly declaring a foreign object with CFFI, consider whether CFFI-Grovel might be more appropriate. Using CFFI-Grovel is more robust to changes in the library to which we're binding, and if I had done this to begin with, it would have saved me several hours of painful debugging.

Wednesday, May 27, 2009

(format blog-stream "hello world")

This blog will focus on Lisp programming. I'm mostly using Common Lisp for my AI and algorithms research, so entries will be related to issues I run into while I use Lisp for work. Hopefully, these will be useful for others interested in using Lisp for algorithmic research.