Sunday, December 20, 2009

closing in on the SBCL thread mystery?

I have distilled the messy code from yesterday into the minimal chunk of code that will cause bizarre multi-thread behavior on SBCL 1.0.33
(defun compute-thread (thread-num rows-per-proc min max mul)
   ;; a pretty straightforward computation
  (let* ((local-min (+ min (* thread-num rows-per-proc)))
     (local-max (1- (+ local-min rows-per-proc)))
     (local-count 0))
    (loop for i from local-min upto local-max
      do (loop for j from min upto max
           do (incf local-count
               (let ((xc (* mul i)))
             (loop for count from 0 below 100
                   do (when (>=  xc 4)
                    (return count))
                   (incf xc)
                   finally (return 100)))
               )))
    #+nil(format *out* "Thread ~a local-min=~a, local-max=~a local-count=~d~%"
        thread-num local-min local-max local-count)
    local-count))


(defun main (num-threads)
  ;; spawn some threads and sum the results
  (loop with rows-per-proc = (/ 100 num-threads)
    for thread in
    (loop for thread-num from 0 below num-threads
          collect
          (let ((thread-num thread-num));establish private binding of thread-num for closure
        (sb-thread:make-thread (lambda ()
                     (compute-thread thread-num rows-per-proc -250 250 0.008d0)))))
    summing (sb-thread:join-thread thread)))

(defun test (num-threads num-iterations expected-val)
  ;; this is just a test wrapper which tests that the result of main is consistent
  (loop for i from 0 below num-iterations do
    (format t "Run ~a:" i)
    (let ((result (main num-threads)))
      (format t "result=~a~%" result)
      (assert (=  expected-val result)))))
I expect: CL-USER> (test 10 1000 (main 1)) to complete without assertions, because the result of (main num-threads) should always be the same as the result of (main 1) Unfortunately:
CL-USER> (test 10 1000 (main 1)) ;; test 10 threads
Run 0:result=300600
Run 1:result=300600
Run 2:result=300600
Run 3:result=300600
.
.  
.
Run 494:result=300600
Run 495:result=300600
Run 496:result=300600
Run 497:result=300600
Run 498:result=300601   ;;<---- Oh no!!!
; Evaluation aborted.
Arghh... ;-( As a sanity check: (test 1 1000 (main 1)) completes without problems -- in other words, 1 thread always seems to computes the same answer, so it seems to be a multi-thread problem.

Saturday, December 19, 2009

A SBCL multi-threaded computation mystery

Update: 2009/12/20 -- this is obsolete -- I now have a a much cleaner, shorter, piece of code which replicates the bizarre behavior below, in today's post. I ran into this thread-related mystery a while ago, forgot all about it, and then ran into the problem again today on a similar chunk of code and wasted a bunch of time, so I'll put it up here. Maybe somebody with experience in SBCL+threads is reading and could give me a hint ;-) Here is a multithreaded version of the simple Mandelbrot benchmark which I earlier parallelized using CL-MPI:
(defconstant +resolution+ 500)
(defconstant +iterations+ 100)
(defparameter *start-time* nil )  
(defparameter *out* *standard-output*)

(defun iters (max xc yc)
  (let ((x xc)
        (y yc))
    (declare (double-float x y)
             (fixnum max))
    (loop for count from 0 below max
          do (when (>= (+ (* x x) (* y y)) 4.0d0)
               (return-from iters count))
             (let ((tmp (+ (- (* x x) (* y y)) xc)))
               (setf y (+ (* 2.0d0 x y) yc)
                     x tmp)))
    max))

(defun compute-thread (thread-num rows-per-proc min max mul)
  (declare (fixnum thread-num rows-per-proc min max))
  (declare (type double-float  mul))
  (let* ((local-min (+ min (the fixnum (* thread-num rows-per-proc))))
     (local-max (1- (+ local-min rows-per-proc)))
     (local-count 0))
    (declare (fixnum local-min local-max local-count))
    (loop for i fixnum from local-min upto local-max
      do (loop for j fixnum from min upto max
           do (incf local-count
               #+call-func (iters +iterations+ (* mul i) (* mul j))
               #-call-func
               (let* ((xc (* mul i))
                  (yc (* mul j))
                  (x xc)
                  (y yc))
                  (declare (double-float x y))
                  (loop for count from 0 below +iterations+
                          do (cond((>= (+ (* x x) (* y y)) 4.0d0)
                                          (return count))
                                        (t
                                         (let ((tmp (+ (- (* x x) (* y y)) xc)))
                                            (setf y (+ (* 2.0d0 x y) yc)
                                                  x tmp))))
                        finally (return +iterations+)))
               )))
    (format *out* "Thread ~a local-min=~a, local-max=~a local-count=~d time ~,4f~%"
        thread-num local-min local-max local-count
        (float (/ (- (get-internal-run-time) *start-time*) internal-time-units-per-second)))
    local-count))

(defun mandelbrot-main (num-threads)
  "straightforward parallelization which partitions the work into equal regions,
   and each region is assigned to a process."
  (declare (fixnum num-threads))
  (format t "---~%")
  (setf *start-time* (get-internal-run-time))
  (let* ((max (truncate +resolution+ 2))
         (min (- max))
         (mul (/ 2.0d0 max)))
    ;; spawn computation threads
    (loop with rows-per-proc = (/ +resolution+ num-threads)
      for thread in
      (loop for thread-num from 0 below num-threads
        collect
        (let ((thread-num thread-num)); needed to establish private binding for closure
          (sb-thread:make-thread (lambda ()
                       (compute-thread thread-num rows-per-proc min max mul)))))
      summing (sb-thread:join-thread thread))))

Note that in compute-thread, there is a conditionally compiled chunk of code: if *features* contains :call-func, the function iters is called. However, if :call-func is not in *features*, then instead, I compile a chunk of code which performs the exact same computation as the iters function. This code should behave identically regardless of #+call-func or #-call-func, right? Wrong -- apparently... ;-( The mandelbrot-main function returns the number of points which are determined to be in the Mandelbrot set, so here is a test function to verify that the computation is consistent and always computes the same, expected value:
(defun test (num-threads num-iterations expected-val)
  (loop for i from 0 below num-iterations do
    (format t "Run ~a:" i)
    (let ((result (mandelbrot-main num-threads)))
      (format t "result=~a~%" result)
      (assert (=  expected-val result)))))
For +resolutions+ = 500 and +iterations+ = 100, the result should be 2905274. When I compiled with #-call-func, (test 10 1000 2905274) completes without asserting an inconsistenty. However, when compiled with #+call-func, (test 10 1000 2905274) asserts an inconsistency after some non-deterministic number of repetitions. e.g.,
CL-USER> (test 10 1000 2905274)
Run 0:---
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0000
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0000
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0120
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0200
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0400
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0400
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0400Thread
8 local-min=150, local-max=199 local-count=17788 time 0.0400
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0440
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0480
result=2905274
.
. 
.
Run 146:---
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0000
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0000
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0120
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0200
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0240
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0240
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0280
Thread 8 local-min=150, local-max=199 local-count=17788 time 0.0360
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0360
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0440
result=2905274
Run 147:---
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0080
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0080
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0120
Thread 8 local-min=150, local-max=199 local-count=17788 time 0.0280
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0360
Thread 3 local-min=-100, local-max=-51 local-count=596147 time 0.0360
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0360
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0360
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0360
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0400
result=2905373    <--- Uh, oh!!!!!!!!!!!
.
. and SLIME complains:
The assertion (= EXPECTED-VAL RESULT) failed.
Yikes! Looking closer, it looks like Thread 3 has a discrepancy: In the "correct" Runs (i.e., for the first 146 Runs), the output from Thread 3 was:
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0200
but on the 147th Run, the output is:
Thread 3 local-min=-100, local-max=-51 local-count=596147 time 0.0360
I have absolutely no idea what could be causing this. The most baffling thing is that when compiled with #-call-func , there is no problem -- that is, calling the iter function instead of executing an inlined version of the same computation seems to be responsible for causing this problem! Very, very bizarre because the iter function and its caller don't touch any special variables or do anything else which looks unsafe to me. In addition, there's no problem if I compile with #+call-func but running but only use 1 thread
(test 1 1000 2905274)
. If I use more than 1 thread, the inconsistency eventually shows up (the more threads, the earlier the problem shows up..). So, this must be something related to multithreading. Is it possible that this is a bug in SBCL, or am I making some silly thread-related mistake here? (I'm running this with SBCL1.0.33 and SBCL1.0.32, x86-64)

Thursday, June 4, 2009

Parallel Mandelbrot computation in Lisp on a cluster

[quick announcement - CL-MPI 0.2.2 is available - this contains today's example code, and adds the with-mpi macro mentioned in the previous post, which was missing in 0.2.1]

This post describes a very simple parallelization of Mandelbrot set computation in Common Lisp using CL-MPI, with results on a multicore machine as well as a cluster using 50 cores.

Recently, there was a Reddit discussion about a Mandelbrot set benchmark. SBCL developer Nikodemus Siivola posted a Common Lisp version on his blog. By coincidence, the Mandelbrot set is a nice and simple computation which can be easily parallelized. I am not jumping into the benchmark game / language speed debate with this post [I'll save that for later ;-) ] I'm just using this as a simple example of parallelization using MPI.

The original sequential code by Nikodemus is here. Here's the parallelized version:

(declaim (optimize (speed 3) (debug 0)(safety 0)))

(defconstant +resolution+ 5000)
(defconstant +iterations+ 100)

(declaim (inline iters))

(defun iters (max xc yc)
  (declare (double-float xc yc))
  (let ((x xc)
        (y yc))
    (declare (double-float x y)
             (fixnum max))
    (loop for count from 0 below max
          do (when (>= (+ (* x x) (* y y)) 4.0d0)
               (return-from iters count))
             (let ((tmp (+ (- (* x x) (* y y)) xc)))
               (setf y (+ (* 2.0d0 x y) yc)
                     x tmp)))
    max))


(defun main ()
  "straightforward parallelization which partitions the work into equal regions,
   and each region is assigned to a process."
  (let* ((max (truncate +resolution+ 2))
         (min (- max))
         (mul (/ 2.0d0 max))
         (local-count 0))
    (declare (fixnum local-count))
    (mpi:with-mpi
      (let* ((start-time (mpi:mpi-wtime))
             (num-procs (mpi:mpi-comm-size))
             (my-rank (mpi:mpi-comm-rank))
             (rows-per-proc (/ +resolution+ num-procs))
             (local-min (+ min (* my-rank rows-per-proc)))
             (local-max (1- (+ local-min rows-per-proc))))
        (declare (fixnum local-min local-max my-rank num-procs))
        (mpi:formatp t "local-min=~a, local-max=~a~%" local-min local-max)
        (loop for i from local-min upto local-max
              do (loop for j from min upto max
                       do (incf local-count (iters +iterations+ (* mul i) (* mul j)))))
        (mpi:formatp t "local-count = ~d computed after time ~,4f seconds~%" 
                     local-count (- (mpi:mpi-wtime) start-time))
        ;; aggregate local counts at rank 0
        (cond ((> my-rank 0) ; send local result to rank 0
               (mpi:mpi-send-auto local-count 0))
              (t ;my-rank is  0. Receive and aggregate local results.
               (loop with count fixnum = local-count
                     for rank from 1 below num-procs
                     do (incf count (mpi:mpi-receive-auto rank))
                     finally (format t "runtime=~,4f seconds, result ~d~%" 
                                      (- (mpi:mpi-wtime) start-time) count))))))))

The iters function is identical to the sequential code, but the
function has been modified for parallelization.

I used the simplest possible parallelization strategy, which is to partition the +resolution+ x +resolution+ grid where the Mandelbrot set is computed into num-procs regions, assigning each region to a processor. As described in the previous post, MPI will execute this program on all processors, but each processor has a unique rank (ID). For each process, the variables local-min and local-max will determine the block of work to be performed (the rows assigned to the process), and iters is called for every point in the assigned region. After the computations are completed, the local results (local-count) are sent to the rank 0 process and aggregated into the final result, (count). See the previous post for a explanation of point-to-point communication using mpi-send-auto and mpi-receive-auto.

Here's the output of running this (a 5000 x 5000 region, 100 iterations per point) on 1 core of a quad-core desktop machine. Note that I'm using (mpi-wtime) , a wrapper for MPI_Wtime, which is a clock in the MPI runtime environment, so this only measures the code within the with-mpi macro (time to start up SBCL+MPI is not included).

alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 1 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[0 :1244124821.1584]: local-min=-2500, local-max=2499
[0 :1244124824.3105]: local-count = 290068052 computed after time 3.1526 seconds
runtime=3.1527 seconds, result 290068052
Now, here's the output using all 4 cores on the same machine (Core2, 2GHz)
alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[1 :1244125086.7273]: local-min=-1250, local-max=-1
[2 :1244125086.7327]: local-min=0, local-max=1249
[3 :1244125086.7390]: local-min=1250, local-max=2499
[0 :1244125086.7478]: local-min=-2500, local-max=-1251
[3 :1244125086.8080]: local-count = 3840768 computed after time 0.0699 seconds
[0 :1244125087.1473]: local-count = 31518526 computed after time 0.4001 seconds
[2 :1244125087.5538]: local-count = 78775348 computed after time 0.8220 seconds
[1 :1244125088.5280]: local-count = 175933410 computed after time 1.7702 seconds
runtime=1.7811 seconds, result 290068052
The speedup compared to 1 core is 3.1527/1.7811 = 1.77. The reason for the inefficiency is clear: there is a huge variance in how long it takes to complete the 4 local-count computations. While proc 3 finished in 0.0699 seconds, proc 1 took 1.7702 seconds! This is because the number of times iters goes through its loop depends on each particular point. We could do certainly do better by devising a better parallelization strategy, and that's a topic for a future post.

OK, so what's the big deal? We could have gotten the same 4-core results with thread-based parallelization using the same algorithm, so why bother with MPI?

The reason, of course, is that we can run this exact same code on any cluster of machines supporting MPI, so unlike threads, we're not limited to a single machine. I am fortunate enough to have access to a huge cluster, so here are the results using 50 cores (Opteron, 2.4GHz).

[snip...]
[24 :0.1349]: local-min=-100, local-max=-1
[23 :0.1350]: local-min=-200, local-max=-101
[26 :0.1351]: local-min=100, local-max=199
[47 :0.1352]: local-min=2200, local-max=2299
[49 :0.1353]: local-min=2400, local-max=2499
[48 :0.1354]: local-min=2300, local-max=2399
[34 :0.1349]: local-min=900, local-max=999
[49 :0.1410]: local-count = 94380 computed after time 0.0071 seconds
[48 :0.1418]: local-count = 169998 computed after time 0.0079 seconds
[47 :0.1419]: local-count = 217754 computed after time 0.0080 seconds
[snip a bunch of output...]
[18 :0.2935]: local-count = 15838486 computed after time 0.1594 seconds
[20 :0.3024]: local-count = 16782526 computed after time 0.1683 seconds
[21 :0.3091]: local-count = 17468466 computed after time 0.1750 seconds
[24 :0.3328]: local-count = 20181214 computed after time 0.1992 seconds
[23 :0.3574]: local-count = 22819862 computed after time 0.2238 seconds
[22 :0.3400]: local-count = 20718664 computed after time 0.2060 seconds
runtime=0.2286 seconds, result 290068052

Cleaning up all processes ...
done.
[Ending Job 4764466]
EXIT_CODE:0
There are other alternatives for running parallel lisp programs on a cluster, such as Fare's library, and you can always build your own parallelization infrastructure using sockets. However, on most large-scale, shared clusters used by many users, such as the one I use, these are not feasible because users can't directly address/access individual compute nodes (among other things, that would make task/job scheduling a nightmare), but MPI is the standard API available in such environments.

In future posts, I'll describe how to parallelize the Mandelbrot computation more efficiently to get better speedups. [edit: removed incorrect statement that MPI_Wtime is synchronized - MPI_Wtime is not synch'd unless MPI_WTIME_IS_GLOBAL is set] [edit: fixed mangled code indentation.]

Tuesday, June 2, 2009

A message-passing "Hello World" in CL-MPI

[Edit: added paragraph breaks]

Let's start the tour of CL-MPI with a parallel "Hello World" that demonstrates point-to-point communications among processes, and a introduction to the MPI runtime. This post assumes you know little or nothing about MPI, so the pace may be a bit slow for people who already know MPI. I promise to pick up the pace in future posts, so stay tuned!

1  #+sbcl(require 'asdf)
2  (eval-when (:compile-toplevel :load-toplevel :execute)
3   (asdf:operate 'asdf:load-op 'cl-mpi))
4
5  (defun mpi-hello-world ()
6    "Each process other than the root (rank 0) sends a message to process 0"
7    (mpi::with-mpi  ;macro which initializes and finalizes  MPI
8       (let ((tag 0))
9         (cond ((/= 0 (mpi:mpi-comm-rank))
10               (let ((msg (format nil "Greetings from process ~a!" (mpi:mpi-comm-rank))))
11                 ;;send msg to proc 0
12                 (mpi:mpi-send-auto msg 0 :tag tag)))
13              (t ; rank is 0
14               (loop for source from 1 below (mpi:mpi-comm-size) do
15                     ;; receive and print message from each processor
16                     (let ((message (mpi:mpi-receive-auto  source :tag tag)))
17                       (format t "~a~%" message))))))))
Before describing this code, a quick overview of the Single-Program, Multiple-Data (SPMD) the MPI execution model, used by MPI: In the SPMD model, all processes execute the same program independently. If we want to run a program on 4 processors using MPI, then MPI spawns 4 separate instances of the program, each as a separate process. Each process has a corresponding identifier, which in MPI terminology is a rank.

Let's try to run our hello-world from the shell (keep reading to see why it has to be from the shell; if it doesn't become obvious, see the footnotes).

If this was an ordinary, sequential SBCL program, we'd execute it as:

alexf@tesla1:~/cl-mpi/examples$ /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
Using the MPICH1.2 implementation of MPI, we do the following (with resulting output):
alexf@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"

Greetings from process 1!
Greetings from process 2!
Greetings from process 3!
alexf@tesla1:~/cl-mpi/examples$ 
As you can see, all we did was add "mpirun -np 4" to the previous command line. The MPI runtime environment is started with an executable, usually called mpirun or mpiexec. We used mpirun. The "-np 4" option specifies that 4 processes are to be started.

We can see what goes on by adding a (break) between lines 8 and 9 in the code. Now, if we enter the same command line, we'll get a break and REPL. We can then run ps (from another window), and see that there are in fact 4 SBCL processes are running (spawned by mpirun), all started with the same command line parameters.1

alexf@tesla1:~/cl-mpi$ ps -aef | grep sbcl
alexf 25743 25740  0 17:58 ?        00:00:22 /usr/local/bin/sbcl
alexf 26978 26975  0 22:46 pts/3    00:00:00 /bin/sh -c mpirun -np 4 /usr/local/bin/sbcl --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
alexf 26979 26978  0 22:46 pts/3    00:00:00 /bin/sh /usr/bin/mpirun -np 4 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27005 26979 14 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27006 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27007 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27008 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27011 28544  0 22:46 pts/6    00:00:00 grep sbcl
The output shown above from the hello-world program show that processes 1-3 sent a greeting message (to process 0), and these were displayed to standard output. Now, let's look at the code: Lines 1-3 are boilerplate for loading the CL-MPI library. The macro with-mpi (line 7) executes the body in a MPI environment. The MPI_Init and MPI_Finalize functions are called before and after the body.

The interesting behavior start on Line 9. Up to this point, all processors have been executing the same code (all processors have loaded the cl-mpi library, started executing the mpi-hello-world function, entered the body of the with-mpi macro, and initialized tag to 0). Recall that each process has an associated rank. Rank 0 is usually referred to as the "root". Line 9 uses mpi-comm-rank, a wrapper for MPI_Comm_rank, to get the process rank, and branches on the rank of the processor, so lines 10-12 are executed on all processors except processor 0, and lines 14-17 are executed only on processor 0.

Let's consider what happens on processors with rank > 0. Line 10 creates a message which contains the process rank, and line 12 calls mpi-send-auto to perform a blocking send operation which sends the message to process 0. A blocking send operation is a point-to-point, synchronous operation where the sender process waits for the receiver to receive the message before continuing execution (MPI and CL-MPI also support non-blocking, asynchronous communications).

Meanwhile, the process with rank 0 executes the loop in line 14, Since source iterates from 1 to (mpi-comm-rank), a wrapper for MPI_Comm_size, which specifies the total number of processes2. At each iteration, line 16 performs a blocking receive operation using mpi-receive-auto, which waits to receive a message from a given source, and line 17 prints the received message.

Note that lines 10-12 are executed in parallel in the processes with rank > 0. The order in which these processes initiate the message send is not specified. However, on process 0, the receiver, lines 14-17 are execute sequentially, and line 16 first waits for a message from process 1 to be received before continuing, regardless of the order in which messages were actually sent. Thus, the structure of lines 14-17 imposes a serialization on the order in which messages are processed and printed by process 0, so the output above shows the message from process 1 first, followed by process 2, and finally process 3. (We do not have to process the messages in this particular order. For example, MPI_Probe or MPI_Waitany can be used to process messages in the order in which they arrive).

The send/receive functions mpi-send-auto and mpi-receive-auto are wrappers around the MPI_Send and MPI_Recv functions, but they do a bit more work than their C counterparts. Both MPI_Send and MPI_Recv requires that the type and size of the objects be fully specified. The "-auto" part of mpi-send-auto and mpi-receive-auto implement a simple protocol by which any object can be sent and received without specification of type or size. This is convenient in cases where communications efficiency is not critical. CL-MPI also provides lower-level, functions which map more directly to MPI_Send and MPI_Recv so that you can squeeze out performance, but even for applications where every bit of performance matters, it's nice to be able to use the -auto versions during the development phase (you know, the whole "dynamic language" thing...)

So that was a simple example of point-to-point communications. Coming up: more complicated examples that show what parallel programming in CL-MPI is actually good for.


1 Warning -- if you actually try this, you'll have to kill all of the SBCL processes -- as can be expected when we have multiple Lisp processes all contending for the standard in/out, I/O in MPI when multiple REPLs get invoked is a messy issue. If you were puzzled about all this talk of running things from the command line in Lisp, it should be clear now -- the multi-process MPI runtime and the usual way of using Lisp with interactive REPLs don't mix too well. There can be some issues with debugging and thread-based parallelism, but with process-based parallelism such as MPI, the problem is worse. On the other hand, the REPLs do work in a more or less reasonable way, but you just have to be careful how you use them (maybe I'll write a post on parallel debugging at some point). Also, the debugging situation with CL-MPI is no worse than with any other MPI language binding -- this is the price to be paid for a framework that allows parallelism across different machines.

2 MPI Communicators (what "Comm" stands for) are more complex than that, but let's ignore that for now.

Sunday, May 31, 2009

Parallel Programming in Common Lisp

To kick off a series of posts about parallel programming in Common Lisp, let's first review what's available.

Two models for parallel programming

First, there are basically two models for parallel programming:

  • Parallelism based on threads (a.k.a. shared-memory model). Multiple threads are spawned by a single process. This is the most common approach in a multi-core machine. For example, we can spawn one computational thread per core. Memory can be shared among threads, so communications between computations is usually implemented by threads writing to shared memory spaces.

  • Parallelism based on message passing (a.k.a. distributed memory model). This is the most common and natural approach in a cluster of machines, because in a cluster of commodity machines, there's no hardware support for sharing RAM among multiple machines. Communication among computations is done by explicitly sending message structures according to some messaging protocol.

It is possible to mix these two models. That is, we can have message passing among multi-core machines, and within each machine, use thread-based parallelism.

Threads in Common Lisp

When discussing threads, we first have to distinguish between native threads and non-native threads. Native threads are handled by the operating system. Native threads allow us to take advantage of multiple cores. For example, in a 4-core machine, we can spawn 4 computational threads that are executed in parallel. Non-native threads are handled by the Lisp implementation, and do not allow true parallelism. If we spawn multiple non-native threads, they are executed sequentially (the threads yield control, usually cooperatively, to each other). So, when talking about parallel processing, we are really interested in native threads.

Some languages such as Java provide a standardized threading library. The Common Lisp standard does not include threads, so it is up to the implementations to provide threading facilities. Although there is no standard thread API for Common Lisp, Bordeaux-Threads is an attempt to provide a portable layer on top of the implementation-specific APIs.

Here's a table of threading support for most of the current Common Lisp implementations.

Native Non-native
ABCL all platforms -
Allegro Windows other platforms
CLISP - -
CMUCL - x86 (Linux,BSD)
Clozure CL all platforms -
Corman CL Windows -
GCL - -
Scieneer all platforms -
SBCL X86 (Linux,BSD) -
So, native thread support is not universal. If we're interested in parallelizing computations in Common Lisp, this is unfortunate, if a particular implementation of interest happens to not support native threading on the selected platform. You might think that since SBCL offers native threading on x86, and Clozure CL offers native threading everywhere, we're in pretty good shape. However, suppose that we want to use CLISP because the computational bottleneck in our application is bignum arithmetic, and CLISP has much faster bignum arithmetic than the other implementations -- no dice. Also, a particular application may require the use of non-portable code or libraries which require a particular implementation that doesn't support native threads for a required platform.

Message-Passing in Common Lisp

Message passing is less commonly used than threading in most languages (with the notable exception of Erlang). As far as I know, there is only one implementation which supports parallelism based on message passing, which is GCL, which includes an interface to MPI, originally by Gene Cooperman.

Coming up: A New, Portable Message-Passing Library for Common Lisp

I've been working on CL-MPI, a portable, CFFI-based binding for the MPI message passing library. CL-MPI enables portable, parallel programming in Common Lisp using a message-passing model on either a cluster of machines, or a single multicore machine. I have just made the most recent version (0.2.1) available for download. So far, I've tested CL-MPI with SBCL and CMUCL on Linux. I initially developed CL-MPI to do massively parallel computations on a large cluster at my university. But since we can run MPI on a single multi-core machine, one interesting use for CL-MPI is to provide a "poor man's multiprocessing solution" for some Common Lisp implementations which don't have native threading capabilities, such as CMUCL and CLISP. Unfortunately, CL-MPI is not really well documented yet, and as a prerequisite, you'll need to have basic understanding of the MPI programming model (here is an excellent tutorial). I'll write a series of posts describing CL-MPI and some example applications, and hopefully, this will be the basis of a tutorial. Stay tuned...

Friday, May 29, 2009

CFFI vs UFFI performance

The current, de facto portable foreign function interface (FFI) for Common Lisp is CFFI. An alternative to CFFI is UFFI, a slightly older FFI created by Kevin Rosenberg.

The CFFI package includes a compatibility layer for UFFI called CFFI-UFFI-COMPAT, which can be used as a drop-in replacement for UFFI -- in other words, if CFFI is installed on our machine, we can use systems that depend on UFFI simply by changing the dependency in the ASDF system definition from "UFFI" to "CFFI-UFFI-COMPAT". The obvious benefit is that installing UFFI becomes unnecessary.

A surprising side benefit can be improved performance.

I recently tried using CL-GD, a FFI wrapper for the GD graphics library. CL-GD uses UFFI. I implemented a simple function which compares the RGB components of both images.

(defun square (x)
  (the fixnum (* (the fixnum x) (the fixnum x))))

(defun compare-images (image1 image2)
 "Compare two images -- uses API exported by CL-GD"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img-color (cl-gd:get-pixel y x :image image2))
                     (target-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :red img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :red target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :blue img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :blue target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :green img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :green target-color :image image1))))))))
   err))

First, the timing with the standard CL-GD: (SBCL 1.0.28, CL-GD 0.5.6, UFFI-1.6.1, 3GHz Intel Core2 Duo, Linux))

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 1.684 seconds of real time
 1.688105 seconds of total run time (1.688105 user, 0.000000 system)
 100.24% CPU
 5,061,067,326 processor cycles
 141,296 bytes consed
0

1.68 seconds to compare a pair of 1024x768 images is extremely slow. Since iterating over a couple of 1024x768 arrays and computing their difference in Common Lisp doesn't take anything close to 1.7 seconds (see below), it seemed that the bottleneck must be in the FFI.

For many applications, the FFI overheads don't really matter that much, because the goal of using the FFI is to use a library written in another language without having to reimplement the library in Lisp. On the other hand, another reason for using foreign libraries is to use fast libraries implemented in C. Since CFFI is being used for low-level libraries such as IOLib, where performance matters, I thought I'd try using the UFFI compatibility layer for CFFI.

All I did was edit the cl-gd.asd file so that the dependency on UFFI was replaced by a dependency on CFFI-UFFI-COMPAT (CL-GD already depends on CFFI-UFFI-COMPAT for CLISP and OpenMCL, but by default, depends on UFFI for SBCL).

The improvement wasimpressive - 0.336 seconds using CFFI-UFFI-COMPAT, comapred to 1.68 seconds with UFFI, a 5x speedup.

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.336 seconds of real time
 0.336021 seconds of total run time (0.336021 user, 0.000000 system)
 100.00% CPU
 1,010,111,976 processor cycles
 21,712 bytes consed

Of course, there is still a massive overhead due to accessing pixel information through the CL-GD API, which is partially due to the FFI layer, partially due to GD, and partially due to CL-GD. How large is this overhead? Here's a simple function which compares two color arrays written in pure Common Lisp: This is not even that optimized -- it doesn't even declare the types of image1 and image2.

(defstruct color
  (red 0 :type fixnum)
  (blue 0 :type fixnum)
  (green 0 :type fixnum))
(defun compare-2darrays (image1 image2 height width)
 "Compare two images -- uses API exported by CL-GD"
 (declare (fixnum  height width))
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
          (loop for x fixnum from 0 below width do
               (let ((color1 (aref image1 y x))
                     (color2 (aref image2 y x)))
                 (incf err (square (- (color-red color1) (color-red color2))))
                 (incf err (square (- (color-blue color1) (color-blue color2))))
                 (incf err (square (- (color-green color1) (color-green color2)))))))))
The results?
CL-USER> (let ((img1 (make-array '(1024 768) :initial-element (make-color)))
               (img2 (make-array '(1024 768) :initial-element (make-color))))
           (time (compare-2darrays img1 img2 1024 768)))
Evaluation took:
  0.048 seconds of real time
  0.048003 seconds of total run time (0.048003 user, 0.000000 system)
  100.00% CPU
  143,562,681 processor cycles
  0 bytes consed  

0.048 seconds to perform basically the same operations (iterate over both arrays, computing the square of their difference, compared to 0.336 seconds for the code which accesses CL-GD via CFFI.

I considered the overhead within CFFI/UFFI, but there's another source of inefficiency, which is the CL-GD binding code. As clearly stated in the docs, CL-GD wasn't designed for speed, and in order to provide a nice API, there is some overhead in the binding code. I'm abusing CL-GD by using it for repetitive pixel-level access because GD is a image creation library and not an image processing library, but I needed to load and create some GIF files, and the first library I ran into was CL-GD.

Here is another version of the CL-GD based code where I bypass the officially exported CL-GD API and use some internal functions (I know some people don't like it, but being able to access internal symbols with "::" is a great feature of Common Lisp, imho).

(defun compare-images-raw (image1 image2)
 "Compare two images -- use internal functions to eliminate overhead
added by CL-GD API"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1))
       (img1 (cl-gd::img image1))
       (img2 (cl-gd::img image2)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img2-color (cl-gd:get-pixel y x :image image2))
                     (img1-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-red img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-red img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-green img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-green img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-blue img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-blue img1 img1-color))))))))
   err))
The results:
EVO-LISA> (progn
           (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
           (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
           (time (compare-images-raw *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.282 seconds of real time
 0.284018 seconds of total run time (0.284018 user, 0.000000 system)
 100.71% CPU
 845,516,097 processor cycles
 21,936 bytes consed
So we're down to 0.282 seconds from 0.336 seconds. Noticeable, but still a long ways to go from the raw CL code. If this was an important enough example (it's not), I could try, for exmaple, eliminating the FFI overhead by digging around the GD API and possibly modifying it to implement a quicker way to grab an entire image at once, instead of pixel by pixel.

I haven't looked deeper into why CFFI turned out to be 5x faster than UFFI for this example, and there may be examples where UFFI is faster. But the lesson remains the same: (1) FFI can in the wrapper libraries we use can be a peformance bottleneck, (2) CFFI-UFFI-COMPAT offers a quick way to swap in CFFI for UFFI.

I sent a summary of the above to Edi Weitz, author of CL-GD (and many other great libraries!), and he suggested that I send a patch, after more testing -- unfortunately, CL-GD failed a couple of tests in the test suite and I don't have time to dig around right now. Maybe a little later...

CFFI-Grovel Rocks!

This weekend, I was working on CL-MPI, my Common Lisp bindings for MPI, a standard message passing interface which allows distributed memory parallel programming based on message passing.I had previously developed and tested CL-MPI with MPICH1.2, and I wanted to see how it worked with the latest version of MPICH2. Since the bindings are implemented with CFFI, I didn't expect any problems. However, the test suite started behaving in a very bizarre manner, crashing on some tests and freezing on others. I was stumped for a while, but the the investigation was an eye-opener: I needed to use a struct defined as follows in MPICH 1.2
typedef struct {
 int count;
 int MPI_SOURCE;
 int MPI_TAG;
 int MPI_ERROR;
#if (MPI_STATUS_SIZE > 4)
 int extra[MPI_STATUS_SIZE - 4];
#endif
} MPI_Status;
so I had declared the foreign struct in my binding source (line 65):
(cffi:defcstruct MPI_Status
 (count :int)
 (MPI_SOURCE :int)
 (MPI_TAG :int)
 (MPI_ERROR :int))
This basically specifies that MPI_Status is a C struct with int slots, and says that the first 4 bytes (int) of the struct is the count field, the next int is the MPI_SOURCE, and so on. Basically a straightforward translation of the C struct definition. This worked great -- until I tried to use the bindings with MPICH2. Running the CL-MPI test suite resulted in mysterious crashes and freezes -- not a happy transition to from MPICH to MPICH2! After thrashing around trying to isolate the problem, it turns out that in MPICH2, the MPI_Status struct definition had been changed to:
typedef struct MPI_Status {
   int count;
   int cancelled;
   int MPI_SOURCE;
   int MPI_TAG;
   int MPI_ERROR;
  
} MPI_Status;
Oops! No wonder everything got messed up -- Note the new 'cancelled' field inserted after the count field. The fields of the struct no longer aligned with the defcstruct defintion above!

OK, so what to do? I could create two different versions of the defcstruct, one which works with MPICH1.2 and the other for MPICH, and then use conditional compilation (e.g., #+MPICH1 #+MPICH2...). The problem with that solution is that if the C definition of MPI_Status changes yet again in a future MPICH release, then we are back to square one.

In this case, the solution is to not directly use the defcstruct to create a foreign struct definition. Instead, CFFI-Grovel can be used to automagically generate the correct defcstruct. This is the file I created to use CFFI-Grovel for CL-MPI. Here (Line 47), I add the CFFI-Grovel declaration:

(cstruct MPI_Status "MPI_Status"
 (count "count" :type :int)
 (MPI_SOURCE "MPI_SOURCE" :type :int)
 (MPI_TAG "MPI_TAG" :type :int)
 (MPI_ERROR "MPI_ERROR" :type :int))
This states that I want to use a foreign (C) struct named "MPI_Status", giving it the Lisp name MPI_Status, and that I am interested in 4 integer fields: count, MPI_SOURCE, MPI_TAG, MPI_ERROR. This declaration does not specify anything about the ordering of the slots in the C struct. It also does not say anything about the completeness of this mapping. In other words, the C MPI_Status struct can contain other fields which are not mentioned in this CFFI-grovel cstruct definition. In contrast, the original CFFI defcstruct used above is a more concrete declaration, which completely specifies the memory layout of the C struct we are mapping. The CFFI-Grovel based code does the right thing for both MPICH1.X and the latest MPICH2, and if future versions of MPICH2 change the MPI_Status struct definition by chaning field order or adding fields, no problem. Lesson learned: before directly declaring a foreign object with CFFI, consider whether CFFI-Grovel might be more appropriate. Using CFFI-Grovel is more robust to changes in the library to which we're binding, and if I had done this to begin with, it would have saved me several hours of painful debugging.