Thursday, June 4, 2009

Parallel Mandelbrot computation in Lisp on a cluster

[quick announcement - CL-MPI 0.2.2 is available - this contains today's example code, and adds the with-mpi macro mentioned in the previous post, which was missing in 0.2.1]

This post describes a very simple parallelization of Mandelbrot set computation in Common Lisp using CL-MPI, with results on a multicore machine as well as a cluster using 50 cores.

Recently, there was a Reddit discussion about a Mandelbrot set benchmark. SBCL developer Nikodemus Siivola posted a Common Lisp version on his blog. By coincidence, the Mandelbrot set is a nice and simple computation which can be easily parallelized. I am not jumping into the benchmark game / language speed debate with this post [I'll save that for later ;-) ] I'm just using this as a simple example of parallelization using MPI.

The original sequential code by Nikodemus is here. Here's the parallelized version:

(declaim (optimize (speed 3) (debug 0)(safety 0)))

(defconstant +resolution+ 5000)
(defconstant +iterations+ 100)

(declaim (inline iters))

(defun iters (max xc yc)
  (declare (double-float xc yc))
  (let ((x xc)
        (y yc))
    (declare (double-float x y)
             (fixnum max))
    (loop for count from 0 below max
          do (when (>= (+ (* x x) (* y y)) 4.0d0)
               (return-from iters count))
             (let ((tmp (+ (- (* x x) (* y y)) xc)))
               (setf y (+ (* 2.0d0 x y) yc)
                     x tmp)))
    max))


(defun main ()
  "straightforward parallelization which partitions the work into equal regions,
   and each region is assigned to a process."
  (let* ((max (truncate +resolution+ 2))
         (min (- max))
         (mul (/ 2.0d0 max))
         (local-count 0))
    (declare (fixnum local-count))
    (mpi:with-mpi
      (let* ((start-time (mpi:mpi-wtime))
             (num-procs (mpi:mpi-comm-size))
             (my-rank (mpi:mpi-comm-rank))
             (rows-per-proc (/ +resolution+ num-procs))
             (local-min (+ min (* my-rank rows-per-proc)))
             (local-max (1- (+ local-min rows-per-proc))))
        (declare (fixnum local-min local-max my-rank num-procs))
        (mpi:formatp t "local-min=~a, local-max=~a~%" local-min local-max)
        (loop for i from local-min upto local-max
              do (loop for j from min upto max
                       do (incf local-count (iters +iterations+ (* mul i) (* mul j)))))
        (mpi:formatp t "local-count = ~d computed after time ~,4f seconds~%" 
                     local-count (- (mpi:mpi-wtime) start-time))
        ;; aggregate local counts at rank 0
        (cond ((> my-rank 0) ; send local result to rank 0
               (mpi:mpi-send-auto local-count 0))
              (t ;my-rank is  0. Receive and aggregate local results.
               (loop with count fixnum = local-count
                     for rank from 1 below num-procs
                     do (incf count (mpi:mpi-receive-auto rank))
                     finally (format t "runtime=~,4f seconds, result ~d~%" 
                                      (- (mpi:mpi-wtime) start-time) count))))))))

The iters function is identical to the sequential code, but the
function has been modified for parallelization.

I used the simplest possible parallelization strategy, which is to partition the +resolution+ x +resolution+ grid where the Mandelbrot set is computed into num-procs regions, assigning each region to a processor. As described in the previous post, MPI will execute this program on all processors, but each processor has a unique rank (ID). For each process, the variables local-min and local-max will determine the block of work to be performed (the rows assigned to the process), and iters is called for every point in the assigned region. After the computations are completed, the local results (local-count) are sent to the rank 0 process and aggregated into the final result, (count). See the previous post for a explanation of point-to-point communication using mpi-send-auto and mpi-receive-auto.

Here's the output of running this (a 5000 x 5000 region, 100 iterations per point) on 1 core of a quad-core desktop machine. Note that I'm using (mpi-wtime) , a wrapper for MPI_Wtime, which is a clock in the MPI runtime environment, so this only measures the code within the with-mpi macro (time to start up SBCL+MPI is not included).

alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 1 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[0 :1244124821.1584]: local-min=-2500, local-max=2499
[0 :1244124824.3105]: local-count = 290068052 computed after time 3.1526 seconds
runtime=3.1527 seconds, result 290068052
Now, here's the output using all 4 cores on the same machine (Core2, 2GHz)
alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))"
[1 :1244125086.7273]: local-min=-1250, local-max=-1
[2 :1244125086.7327]: local-min=0, local-max=1249
[3 :1244125086.7390]: local-min=1250, local-max=2499
[0 :1244125086.7478]: local-min=-2500, local-max=-1251
[3 :1244125086.8080]: local-count = 3840768 computed after time 0.0699 seconds
[0 :1244125087.1473]: local-count = 31518526 computed after time 0.4001 seconds
[2 :1244125087.5538]: local-count = 78775348 computed after time 0.8220 seconds
[1 :1244125088.5280]: local-count = 175933410 computed after time 1.7702 seconds
runtime=1.7811 seconds, result 290068052
The speedup compared to 1 core is 3.1527/1.7811 = 1.77. The reason for the inefficiency is clear: there is a huge variance in how long it takes to complete the 4 local-count computations. While proc 3 finished in 0.0699 seconds, proc 1 took 1.7702 seconds! This is because the number of times iters goes through its loop depends on each particular point. We could do certainly do better by devising a better parallelization strategy, and that's a topic for a future post.

OK, so what's the big deal? We could have gotten the same 4-core results with thread-based parallelization using the same algorithm, so why bother with MPI?

The reason, of course, is that we can run this exact same code on any cluster of machines supporting MPI, so unlike threads, we're not limited to a single machine. I am fortunate enough to have access to a huge cluster, so here are the results using 50 cores (Opteron, 2.4GHz).

[snip...]
[24 :0.1349]: local-min=-100, local-max=-1
[23 :0.1350]: local-min=-200, local-max=-101
[26 :0.1351]: local-min=100, local-max=199
[47 :0.1352]: local-min=2200, local-max=2299
[49 :0.1353]: local-min=2400, local-max=2499
[48 :0.1354]: local-min=2300, local-max=2399
[34 :0.1349]: local-min=900, local-max=999
[49 :0.1410]: local-count = 94380 computed after time 0.0071 seconds
[48 :0.1418]: local-count = 169998 computed after time 0.0079 seconds
[47 :0.1419]: local-count = 217754 computed after time 0.0080 seconds
[snip a bunch of output...]
[18 :0.2935]: local-count = 15838486 computed after time 0.1594 seconds
[20 :0.3024]: local-count = 16782526 computed after time 0.1683 seconds
[21 :0.3091]: local-count = 17468466 computed after time 0.1750 seconds
[24 :0.3328]: local-count = 20181214 computed after time 0.1992 seconds
[23 :0.3574]: local-count = 22819862 computed after time 0.2238 seconds
[22 :0.3400]: local-count = 20718664 computed after time 0.2060 seconds
runtime=0.2286 seconds, result 290068052

Cleaning up all processes ...
done.
[Ending Job 4764466]
EXIT_CODE:0
There are other alternatives for running parallel lisp programs on a cluster, such as Fare's library, and you can always build your own parallelization infrastructure using sockets. However, on most large-scale, shared clusters used by many users, such as the one I use, these are not feasible because users can't directly address/access individual compute nodes (among other things, that would make task/job scheduling a nightmare), but MPI is the standard API available in such environments.

In future posts, I'll describe how to parallelize the Mandelbrot computation more efficiently to get better speedups. [edit: removed incorrect statement that MPI_Wtime is synchronized - MPI_Wtime is not synch'd unless MPI_WTIME_IS_GLOBAL is set] [edit: fixed mangled code indentation.]

Tuesday, June 2, 2009

A message-passing "Hello World" in CL-MPI

[Edit: added paragraph breaks]

Let's start the tour of CL-MPI with a parallel "Hello World" that demonstrates point-to-point communications among processes, and a introduction to the MPI runtime. This post assumes you know little or nothing about MPI, so the pace may be a bit slow for people who already know MPI. I promise to pick up the pace in future posts, so stay tuned!

1  #+sbcl(require 'asdf)
2  (eval-when (:compile-toplevel :load-toplevel :execute)
3   (asdf:operate 'asdf:load-op 'cl-mpi))
4
5  (defun mpi-hello-world ()
6    "Each process other than the root (rank 0) sends a message to process 0"
7    (mpi::with-mpi  ;macro which initializes and finalizes  MPI
8       (let ((tag 0))
9         (cond ((/= 0 (mpi:mpi-comm-rank))
10               (let ((msg (format nil "Greetings from process ~a!" (mpi:mpi-comm-rank))))
11                 ;;send msg to proc 0
12                 (mpi:mpi-send-auto msg 0 :tag tag)))
13              (t ; rank is 0
14               (loop for source from 1 below (mpi:mpi-comm-size) do
15                     ;; receive and print message from each processor
16                     (let ((message (mpi:mpi-receive-auto  source :tag tag)))
17                       (format t "~a~%" message))))))))
Before describing this code, a quick overview of the Single-Program, Multiple-Data (SPMD) the MPI execution model, used by MPI: In the SPMD model, all processes execute the same program independently. If we want to run a program on 4 processors using MPI, then MPI spawns 4 separate instances of the program, each as a separate process. Each process has a corresponding identifier, which in MPI terminology is a rank.

Let's try to run our hello-world from the shell (keep reading to see why it has to be from the shell; if it doesn't become obvious, see the footnotes).

If this was an ordinary, sequential SBCL program, we'd execute it as:

alexf@tesla1:~/cl-mpi/examples$ /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
Using the MPICH1.2 implementation of MPI, we do the following (with resulting output):
alexf@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"

Greetings from process 1!
Greetings from process 2!
Greetings from process 3!
alexf@tesla1:~/cl-mpi/examples$ 
As you can see, all we did was add "mpirun -np 4" to the previous command line. The MPI runtime environment is started with an executable, usually called mpirun or mpiexec. We used mpirun. The "-np 4" option specifies that 4 processes are to be started.

We can see what goes on by adding a (break) between lines 8 and 9 in the code. Now, if we enter the same command line, we'll get a break and REPL. We can then run ps (from another window), and see that there are in fact 4 SBCL processes are running (spawned by mpirun), all started with the same command line parameters.1

alexf@tesla1:~/cl-mpi$ ps -aef | grep sbcl
alexf 25743 25740  0 17:58 ?        00:00:22 /usr/local/bin/sbcl
alexf 26978 26975  0 22:46 pts/3    00:00:00 /bin/sh -c mpirun -np 4 /usr/local/bin/sbcl --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
alexf 26979 26978  0 22:46 pts/3    00:00:00 /bin/sh /usr/bin/mpirun -np 4 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27005 26979 14 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27006 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27007 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27008 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27011 28544  0 22:46 pts/6    00:00:00 grep sbcl
The output shown above from the hello-world program show that processes 1-3 sent a greeting message (to process 0), and these were displayed to standard output. Now, let's look at the code: Lines 1-3 are boilerplate for loading the CL-MPI library. The macro with-mpi (line 7) executes the body in a MPI environment. The MPI_Init and MPI_Finalize functions are called before and after the body.

The interesting behavior start on Line 9. Up to this point, all processors have been executing the same code (all processors have loaded the cl-mpi library, started executing the mpi-hello-world function, entered the body of the with-mpi macro, and initialized tag to 0). Recall that each process has an associated rank. Rank 0 is usually referred to as the "root". Line 9 uses mpi-comm-rank, a wrapper for MPI_Comm_rank, to get the process rank, and branches on the rank of the processor, so lines 10-12 are executed on all processors except processor 0, and lines 14-17 are executed only on processor 0.

Let's consider what happens on processors with rank > 0. Line 10 creates a message which contains the process rank, and line 12 calls mpi-send-auto to perform a blocking send operation which sends the message to process 0. A blocking send operation is a point-to-point, synchronous operation where the sender process waits for the receiver to receive the message before continuing execution (MPI and CL-MPI also support non-blocking, asynchronous communications).

Meanwhile, the process with rank 0 executes the loop in line 14, Since source iterates from 1 to (mpi-comm-rank), a wrapper for MPI_Comm_size, which specifies the total number of processes2. At each iteration, line 16 performs a blocking receive operation using mpi-receive-auto, which waits to receive a message from a given source, and line 17 prints the received message.

Note that lines 10-12 are executed in parallel in the processes with rank > 0. The order in which these processes initiate the message send is not specified. However, on process 0, the receiver, lines 14-17 are execute sequentially, and line 16 first waits for a message from process 1 to be received before continuing, regardless of the order in which messages were actually sent. Thus, the structure of lines 14-17 imposes a serialization on the order in which messages are processed and printed by process 0, so the output above shows the message from process 1 first, followed by process 2, and finally process 3. (We do not have to process the messages in this particular order. For example, MPI_Probe or MPI_Waitany can be used to process messages in the order in which they arrive).

The send/receive functions mpi-send-auto and mpi-receive-auto are wrappers around the MPI_Send and MPI_Recv functions, but they do a bit more work than their C counterparts. Both MPI_Send and MPI_Recv requires that the type and size of the objects be fully specified. The "-auto" part of mpi-send-auto and mpi-receive-auto implement a simple protocol by which any object can be sent and received without specification of type or size. This is convenient in cases where communications efficiency is not critical. CL-MPI also provides lower-level, functions which map more directly to MPI_Send and MPI_Recv so that you can squeeze out performance, but even for applications where every bit of performance matters, it's nice to be able to use the -auto versions during the development phase (you know, the whole "dynamic language" thing...)

So that was a simple example of point-to-point communications. Coming up: more complicated examples that show what parallel programming in CL-MPI is actually good for.


1 Warning -- if you actually try this, you'll have to kill all of the SBCL processes -- as can be expected when we have multiple Lisp processes all contending for the standard in/out, I/O in MPI when multiple REPLs get invoked is a messy issue. If you were puzzled about all this talk of running things from the command line in Lisp, it should be clear now -- the multi-process MPI runtime and the usual way of using Lisp with interactive REPLs don't mix too well. There can be some issues with debugging and thread-based parallelism, but with process-based parallelism such as MPI, the problem is worse. On the other hand, the REPLs do work in a more or less reasonable way, but you just have to be careful how you use them (maybe I'll write a post on parallel debugging at some point). Also, the debugging situation with CL-MPI is no worse than with any other MPI language binding -- this is the price to be paid for a framework that allows parallelism across different machines.

2 MPI Communicators (what "Comm" stands for) are more complex than that, but let's ignore that for now.