`with-mpi`macro mentioned in the previous post, which was missing in 0.2.1]

This post describes a very simple parallelization of Mandelbrot set computation in Common Lisp using CL-MPI, with results on a multicore machine as well as a cluster using 50 cores.

Recently, there was a Reddit discussion about a Mandelbrot set benchmark. SBCL developer Nikodemus Siivola posted a Common Lisp version on his blog. By coincidence, the Mandelbrot set is a nice and simple computation which can be easily parallelized. I am **not** jumping into the benchmark game / language speed debate with this post [I'll save that for later ;-) ] I'm just using this as a simple example of parallelization using MPI.

The original sequential code by Nikodemus is here. Here's the parallelized version:

(declaim (optimize (speed 3) (debug 0)(safety 0))) (defconstant +resolution+ 5000) (defconstant +iterations+ 100) (declaim (inline iters)) (defun iters (max xc yc) (declare (double-float xc yc)) (let ((x xc) (y yc)) (declare (double-float x y) (fixnum max)) (loop for count from 0 below max do (when (>= (+ (* x x) (* y y)) 4.0d0) (return-from iters count)) (let ((tmp (+ (- (* x x) (* y y)) xc))) (setf y (+ (* 2.0d0 x y) yc) x tmp))) max)) (defun main () "straightforward parallelization which partitions the work into equal regions, and each region is assigned to a process." (let* ((max (truncate +resolution+ 2)) (min (- max)) (mul (/ 2.0d0 max)) (local-count 0)) (declare (fixnum local-count)) (mpi:with-mpi (let* ((start-time (mpi:mpi-wtime)) (num-procs (mpi:mpi-comm-size)) (my-rank (mpi:mpi-comm-rank)) (rows-per-proc (/ +resolution+ num-procs)) (local-min (+ min (* my-rank rows-per-proc))) (local-max (1- (+ local-min rows-per-proc)))) (declare (fixnum local-min local-max my-rank num-procs)) (mpi:formatp t "local-min=~a, local-max=~a~%" local-min local-max) (loop for i from local-min upto local-max do (loop for j from min upto max do (incf local-count (iters +iterations+ (* mul i) (* mul j))))) (mpi:formatp t "local-count = ~d computed after time ~,4f seconds~%" local-count (- (mpi:mpi-wtime) start-time)) ;; aggregate local counts at rank 0 (cond ((> my-rank 0) ; send local result to rank 0 (mpi:mpi-send-auto local-count 0)) (t ;my-rank is 0. Receive and aggregate local results. (loop with count fixnum = local-count for rank from 1 below num-procs do (incf count (mpi:mpi-receive-auto rank)) finally (format t "runtime=~,4f seconds, result ~d~%" (- (mpi:mpi-wtime) start-time) count))))))))The

`iters`function is identical to the sequential code, but the

I used the simplest possible parallelization strategy, which is to partition the `+resolution+ x +resolution+` grid where the Mandelbrot set is computed into `num-procs` regions, assigning each region to a processor. As described in the previous post, MPI will execute this program on all processors, but each processor has a unique * rank* (ID). For each process, the variables `local-min` and `local-max` will determine the block of work to be performed (the rows assigned to the process), and `iters` is called for every point in the assigned region. After the computations are completed, the local results (`local-count`) are sent to the rank 0 process and aggregated into the final result, (`count`). See the previous post for a explanation of point-to-point communication using `mpi-send-auto` and `mpi-receive-auto`.

Here's the output of running this (a 5000 x 5000 region, 100 iterations per point) on 1 core of a quad-core desktop machine. Note that I'm using `(mpi-wtime)` , a wrapper for `MPI_Wtime`, which is a clock in the MPI runtime environment, so this only measures the code within the `with-mpi` macro (time to start up SBCL+MPI is not included).

alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 1 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))" [0 :1244124821.1584]: local-min=-2500, local-max=2499 [0 :1244124824.3105]: local-count = 290068052 computed after time 3.1526 seconds runtime=3.1527 seconds, result 290068052Now, here's the output using all 4 cores on the same machine (Core2, 2GHz)

alexfs04@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --load "mandel-bench-1" --eval "(progn (main) (sb-ext:quit))" [1 :1244125086.7273]: local-min=-1250, local-max=-1 [2 :1244125086.7327]: local-min=0, local-max=1249 [3 :1244125086.7390]: local-min=1250, local-max=2499 [0 :1244125086.7478]: local-min=-2500, local-max=-1251 [3 :1244125086.8080]: local-count = 3840768 computed after time 0.0699 seconds [0 :1244125087.1473]: local-count = 31518526 computed after time 0.4001 seconds [2 :1244125087.5538]: local-count = 78775348 computed after time 0.8220 seconds [1 :1244125088.5280]: local-count = 175933410 computed after time 1.7702 seconds runtime=1.7811 seconds, result 290068052The speedup compared to 1 core is 3.1527/1.7811 = 1.77. The reason for the inefficiency is clear: there is a huge variance in how long it takes to complete the 4 local-count computations. While proc 3 finished in 0.0699 seconds, proc 1 took 1.7702 seconds! This is because the number of times

`iters`goes through its loop depends on each particular point. We could do certainly do better by devising a better parallelization strategy, and that's a topic for a future post.

OK, so what's the big deal? We could have gotten the same 4-core results with thread-based parallelization using the same algorithm, so why bother with MPI?

The reason, of course, is that we can run this exact same code on any **cluster** of machines supporting MPI, so unlike threads, we're not limited to a single machine. I am fortunate enough to have access to a huge cluster, so here are the results using 50 cores (Opteron, 2.4GHz).

[snip...] [24 :0.1349]: local-min=-100, local-max=-1 [23 :0.1350]: local-min=-200, local-max=-101 [26 :0.1351]: local-min=100, local-max=199 [47 :0.1352]: local-min=2200, local-max=2299 [49 :0.1353]: local-min=2400, local-max=2499 [48 :0.1354]: local-min=2300, local-max=2399 [34 :0.1349]: local-min=900, local-max=999 [49 :0.1410]: local-count = 94380 computed after time 0.0071 seconds [48 :0.1418]: local-count = 169998 computed after time 0.0079 seconds [47 :0.1419]: local-count = 217754 computed after time 0.0080 seconds [snip a bunch of output...] [18 :0.2935]: local-count = 15838486 computed after time 0.1594 seconds [20 :0.3024]: local-count = 16782526 computed after time 0.1683 seconds [21 :0.3091]: local-count = 17468466 computed after time 0.1750 seconds [24 :0.3328]: local-count = 20181214 computed after time 0.1992 seconds [23 :0.3574]: local-count = 22819862 computed after time 0.2238 seconds [22 :0.3400]: local-count = 20718664 computed after time 0.2060 seconds runtime=0.2286 seconds, result 290068052 Cleaning up all processes ... done. [Ending Job 4764466] EXIT_CODE:0There are other alternatives for running parallel lisp programs on a cluster, such as Fare's library, and you can always build your own parallelization infrastructure using sockets. However, on most large-scale, shared clusters used by many users, such as the one I use, these are not feasible because users can't directly address/access individual compute nodes (among other things, that would make task/job scheduling a nightmare), but MPI is the standard API available in such environments.

In future posts, I'll describe how to parallelize the Mandelbrot computation more efficiently to get better speedups. [edit: removed incorrect statement that MPI_Wtime is synchronized - MPI_Wtime is not synch'd unless MPI_WTIME_IS_GLOBAL is set] [edit: fixed mangled code indentation.]

Interesting work! I've programmed in MPI before so I'm familiar with the pain of working with it in low-level languages like C -- it's nice to see the Lispyness start to overtake the C-ness ;-)

ReplyDeleteI'm curious -- if you had a lot of time to hack, how you would go about wrapping the MPI layer with something even more "lispy"?

I just wanted to say that I am pretty excited to see this get implemented. There are definately some inconveniences, like it would be nice to just "MPI" a closure.

ReplyDeleteMaybe one could start the mpirun instances of the Lisp from trivial-shell or some such. Then have the master node report back through a socket connection (or just throught the text returned from the shell).

HilbertAstronaut and Anonymous - Hi, thanks for the comments. You're both pointing out the same basic flaw, which is that CL-MPI currently isn't as "Lispy" as it should be (at both the API-level and also with regard to usage). I agree, and this is an area of ongoing work.

ReplyDelete