Thursday, June 4, 2009

Parallel Mandelbrot computation in Lisp on a cluster

[quick announcement - CL-MPI 0.2.2 is available - this contains today's example code, and adds the with-mpi macro mentioned in the previous post, which was missing in 0.2.1]

This post describes a very simple parallelization of Mandelbrot set computation in Common Lisp using CL-MPI, with results on a multicore machine as well as a cluster using 50 cores.

Recently, there was a Reddit discussion about a Mandelbrot set benchmark. SBCL developer Nikodemus Siivola posted a Common Lisp version on his blog. By coincidence, the Mandelbrot set is a nice and simple computation which can be easily parallelized. I am not jumping into the benchmark game / language speed debate with this post [I'll save that for later ;-) ] I'm just using this as a simple example of parallelization using MPI.

The original sequential code by Nikodemus is here. Here's the parallelized version:

(declaim (optimize (speed 3) (debug 0)(safety 0)))

(defconstant +resolution+ 5000)
(defconstant +iterations+ 100)

(declaim (inline iters))

(defun iters (max xc yc)
  (declare (double-float xc yc))
  (let ((x xc)
        (y yc))
    (declare (double-float x y)
             (fixnum max))
    (loop for count from 0 below max
          do (when (>= (+ (* x x) (* y y)) 4.0d0)
               (return-from iters count))
             (let ((tmp (+ (- (* x x) (* y y)) xc)))
               (setf y (+ (* 2.0d0 x y) yc)
                     x tmp)))
    max))


(defun main ()
  "straightforward parallelization which partitions the work into equal regions,
   and each region is assigned to a process."
  (let* ((max (truncate +resolution+ 2))
         (min (- max))
         (mul (/ 2.0d0 max))
         (local-count 0))
    (declare (fixnum local-count))
    (mpi:with-mpi
      (let* ((start-time (mpi:mpi-wtime))
             (num-procs (mpi:mpi-comm-size))
             (my-rank (mpi:mpi-comm-rank))
             (rows-per-proc (/ +resolution+ num-procs))
             (local-min (+ min (* my-rank rows-per-proc)))
             (local-max (1- (+ local-min rows-per-proc))))
        (declare (fixnum local-min local-max my-rank num-procs))
        (mpi:formatp t "local-min=~a, local-max=~a~%" local-min local-max)
        (loop for i from local-min upto local-max
              do (loop for j from min upto max
                       do (incf local-count (iters +iterations+ (* mul i) (* mul j)))))
        (mpi:formatp t "local-count = ~d computed after time ~,4f seconds~%" 
                     local-count (- (mpi:mpi-wtime) start-time))
        ;; aggregate local counts at rank 0
        (cond ((> my-rank 0) ; send local result to rank 0
               (mpi:mpi-send-auto local-count 0))
              (t ;my-rank is  0. Receive and aggregate local results.
               (loop with count fixnum = local-count
                     for rank from 1 below num-procs
                     do (incf count (mpi:mpi-receive-auto rank))
                     finally (format t "runtime=~,4f seconds, result ~d~%" 
                                      (- (mpi:mpi-wtime) start-time) count))))))))

The iters function is identical to the sequential code, but the
function has been modified for parallelization.

I used the simplest possible parallelization strategy, which is to partition the +resolution+ x +resolution+ grid where the Mandelbrot set is computed into num-procs regions, assigning each region to a processor. As described in the previous post, MPI will execute this program on all processors, but each processor has a unique rank (ID). For each process, the variables local-min and local-max will determine the block of work to be performed (the rows assigned to the process), and iters is called for every point in the assigned region. After the computations are completed, the local results (local-count) are sent to the rank 0 process and aggregated into the final result, (count). See the previous post for a explanation of point-to-point communication using mpi-send-auto and mpi-receive-auto.

Here's the output of running this (a 5000 x 5000 region, 100 iterations per point) on 1 core of a quad-core desktop machine. Note that I'm using (mpi-wtime) , a wrapper for MPI_Wtime, which is a clock in the MPI runtime environment, so this only measures the code within the with-mpi macro (time to start up SBCL+MPI is not included).

alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 1 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[0 :1244124821.1584]: local-min=-2500, local-max=2499
[0 :1244124824.3105]: local-count = 290068052 computed after time 3.1526 seconds
runtime=3.1527 seconds, result 290068052
Now, here's the output using all 4 cores on the same machine (Core2, 2GHz)
alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[1 :1244125086.7273]: local-min=-1250, local-max=-1
[2 :1244125086.7327]: local-min=0, local-max=1249
[3 :1244125086.7390]: local-min=1250, local-max=2499
[0 :1244125086.7478]: local-min=-2500, local-max=-1251
[3 :1244125086.8080]: local-count = 3840768 computed after time 0.0699 seconds
[0 :1244125087.1473]: local-count = 31518526 computed after time 0.4001 seconds
[2 :1244125087.5538]: local-count = 78775348 computed after time 0.8220 seconds
[1 :1244125088.5280]: local-count = 175933410 computed after time 1.7702 seconds
runtime=1.7811 seconds, result 290068052
The speedup compared to 1 core is 3.1527/1.7811 = 1.77. The reason for the inefficiency is clear: there is a huge variance in how long it takes to complete the 4 local-count computations. While proc 3 finished in 0.0699 seconds, proc 1 took 1.7702 seconds! This is because the number of times iters goes through its loop depends on each particular point. We could do certainly do better by devising a better parallelization strategy, and that's a topic for a future post.

OK, so what's the big deal? We could have gotten the same 4-core results with thread-based parallelization using the same algorithm, so why bother with MPI?

The reason, of course, is that we can run this exact same code on any cluster of machines supporting MPI, so unlike threads, we're not limited to a single machine. I am fortunate enough to have access to a huge cluster, so here are the results using 50 cores (Opteron, 2.4GHz).

[snip...]
[24 :0.1349]: local-min=-100, local-max=-1
[23 :0.1350]: local-min=-200, local-max=-101
[26 :0.1351]: local-min=100, local-max=199
[47 :0.1352]: local-min=2200, local-max=2299
[49 :0.1353]: local-min=2400, local-max=2499
[48 :0.1354]: local-min=2300, local-max=2399
[34 :0.1349]: local-min=900, local-max=999
[49 :0.1410]: local-count = 94380 computed after time 0.0071 seconds
[48 :0.1418]: local-count = 169998 computed after time 0.0079 seconds
[47 :0.1419]: local-count = 217754 computed after time 0.0080 seconds
[snip a bunch of output...]
[18 :0.2935]: local-count = 15838486 computed after time 0.1594 seconds
[20 :0.3024]: local-count = 16782526 computed after time 0.1683 seconds
[21 :0.3091]: local-count = 17468466 computed after time 0.1750 seconds
[24 :0.3328]: local-count = 20181214 computed after time 0.1992 seconds
[23 :0.3574]: local-count = 22819862 computed after time 0.2238 seconds
[22 :0.3400]: local-count = 20718664 computed after time 0.2060 seconds
runtime=0.2286 seconds, result 290068052

Cleaning up all processes ...
done.
[Ending Job 4764466]
EXIT_CODE:0
There are other alternatives for running parallel lisp programs on a cluster, such as Fare's library, and you can always build your own parallelization infrastructure using sockets. However, on most large-scale, shared clusters used by many users, such as the one I use, these are not feasible because users can't directly address/access individual compute nodes (among other things, that would make task/job scheduling a nightmare), but MPI is the standard API available in such environments.

In future posts, I'll describe how to parallelize the Mandelbrot computation more efficiently to get better speedups. [edit: removed incorrect statement that MPI_Wtime is synchronized - MPI_Wtime is not synch'd unless MPI_WTIME_IS_GLOBAL is set] [edit: fixed mangled code indentation.]

3 comments:

  1. Interesting work! I've programmed in MPI before so I'm familiar with the pain of working with it in low-level languages like C -- it's nice to see the Lispyness start to overtake the C-ness ;-)

    I'm curious -- if you had a lot of time to hack, how you would go about wrapping the MPI layer with something even more "lispy"?

    ReplyDelete
  2. I just wanted to say that I am pretty excited to see this get implemented. There are definately some inconveniences, like it would be nice to just "MPI" a closure.
    Maybe one could start the mpirun instances of the Lisp from trivial-shell or some such. Then have the master node report back through a socket connection (or just throught the text returned from the shell).

    ReplyDelete
  3. HilbertAstronaut and Anonymous - Hi, thanks for the comments. You're both pointing out the same basic flaw, which is that CL-MPI currently isn't as "Lispy" as it should be (at both the API-level and also with regard to usage). I agree, and this is an area of ongoing work.

    ReplyDelete