Tuesday, June 2, 2009

A message-passing "Hello World" in CL-MPI

[Edit: added paragraph breaks]

Let's start the tour of CL-MPI with a parallel "Hello World" that demonstrates point-to-point communications among processes, and a introduction to the MPI runtime. This post assumes you know little or nothing about MPI, so the pace may be a bit slow for people who already know MPI. I promise to pick up the pace in future posts, so stay tuned!

1  #+sbcl(require 'asdf)
2  (eval-when (:compile-toplevel :load-toplevel :execute)
3   (asdf:operate 'asdf:load-op 'cl-mpi))
4
5  (defun mpi-hello-world ()
6    "Each process other than the root (rank 0) sends a message to process 0"
7    (mpi::with-mpi  ;macro which initializes and finalizes  MPI
8       (let ((tag 0))
9         (cond ((/= 0 (mpi:mpi-comm-rank))
10               (let ((msg (format nil "Greetings from process ~a!" (mpi:mpi-comm-rank))))
11                 ;;send msg to proc 0
12                 (mpi:mpi-send-auto msg 0 :tag tag)))
13              (t ; rank is 0
14               (loop for source from 1 below (mpi:mpi-comm-size) do
15                     ;; receive and print message from each processor
16                     (let ((message (mpi:mpi-receive-auto  source :tag tag)))
17                       (format t "~a~%" message))))))))
Before describing this code, a quick overview of the Single-Program, Multiple-Data (SPMD) the MPI execution model, used by MPI: In the SPMD model, all processes execute the same program independently. If we want to run a program on 4 processors using MPI, then MPI spawns 4 separate instances of the program, each as a separate process. Each process has a corresponding identifier, which in MPI terminology is a rank.

Let's try to run our hello-world from the shell (keep reading to see why it has to be from the shell; if it doesn't become obvious, see the footnotes).

If this was an ordinary, sequential SBCL program, we'd execute it as:

alexf@tesla1:~/cl-mpi/examples$ /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
Using the MPICH1.2 implementation of MPI, we do the following (with resulting output):
alexf@tesla1:~/cl-mpi/examples$ mpirun -np 4 /usr/local/bin/sbcl --noinform --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"

Greetings from process 1!
Greetings from process 2!
Greetings from process 3!
alexf@tesla1:~/cl-mpi/examples$ 
As you can see, all we did was add "mpirun -np 4" to the previous command line. The MPI runtime environment is started with an executable, usually called mpirun or mpiexec. We used mpirun. The "-np 4" option specifies that 4 processes are to be started.

We can see what goes on by adding a (break) between lines 8 and 9 in the code. Now, if we enter the same command line, we'll get a break and REPL. We can then run ps (from another window), and see that there are in fact 4 SBCL processes are running (spawned by mpirun), all started with the same command line parameters.1

alexf@tesla1:~/cl-mpi$ ps -aef | grep sbcl
alexf 25743 25740  0 17:58 ?        00:00:22 /usr/local/bin/sbcl
alexf 26978 26975  0 22:46 pts/3    00:00:00 /bin/sh -c mpirun -np 4 /usr/local/bin/sbcl --load "mpi-hello-world" --eval "(progn (mpi-hello-world)(sb-ext:quit))"
alexf 26979 26978  0 22:46 pts/3    00:00:00 /bin/sh /usr/bin/mpirun -np 4 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27005 26979 14 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27006 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27007 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27008 27005  0 22:46 pts/3    00:00:00 /usr/local/bin/sbcl --load mpi-hello-world --eval (progn (mpi-hello-world)(sb-ext:quit))
alexf 27011 28544  0 22:46 pts/6    00:00:00 grep sbcl
The output shown above from the hello-world program show that processes 1-3 sent a greeting message (to process 0), and these were displayed to standard output. Now, let's look at the code: Lines 1-3 are boilerplate for loading the CL-MPI library. The macro with-mpi (line 7) executes the body in a MPI environment. The MPI_Init and MPI_Finalize functions are called before and after the body.

The interesting behavior start on Line 9. Up to this point, all processors have been executing the same code (all processors have loaded the cl-mpi library, started executing the mpi-hello-world function, entered the body of the with-mpi macro, and initialized tag to 0). Recall that each process has an associated rank. Rank 0 is usually referred to as the "root". Line 9 uses mpi-comm-rank, a wrapper for MPI_Comm_rank, to get the process rank, and branches on the rank of the processor, so lines 10-12 are executed on all processors except processor 0, and lines 14-17 are executed only on processor 0.

Let's consider what happens on processors with rank > 0. Line 10 creates a message which contains the process rank, and line 12 calls mpi-send-auto to perform a blocking send operation which sends the message to process 0. A blocking send operation is a point-to-point, synchronous operation where the sender process waits for the receiver to receive the message before continuing execution (MPI and CL-MPI also support non-blocking, asynchronous communications).

Meanwhile, the process with rank 0 executes the loop in line 14, Since source iterates from 1 to (mpi-comm-rank), a wrapper for MPI_Comm_size, which specifies the total number of processes2. At each iteration, line 16 performs a blocking receive operation using mpi-receive-auto, which waits to receive a message from a given source, and line 17 prints the received message.

Note that lines 10-12 are executed in parallel in the processes with rank > 0. The order in which these processes initiate the message send is not specified. However, on process 0, the receiver, lines 14-17 are execute sequentially, and line 16 first waits for a message from process 1 to be received before continuing, regardless of the order in which messages were actually sent. Thus, the structure of lines 14-17 imposes a serialization on the order in which messages are processed and printed by process 0, so the output above shows the message from process 1 first, followed by process 2, and finally process 3. (We do not have to process the messages in this particular order. For example, MPI_Probe or MPI_Waitany can be used to process messages in the order in which they arrive).

The send/receive functions mpi-send-auto and mpi-receive-auto are wrappers around the MPI_Send and MPI_Recv functions, but they do a bit more work than their C counterparts. Both MPI_Send and MPI_Recv requires that the type and size of the objects be fully specified. The "-auto" part of mpi-send-auto and mpi-receive-auto implement a simple protocol by which any object can be sent and received without specification of type or size. This is convenient in cases where communications efficiency is not critical. CL-MPI also provides lower-level, functions which map more directly to MPI_Send and MPI_Recv so that you can squeeze out performance, but even for applications where every bit of performance matters, it's nice to be able to use the -auto versions during the development phase (you know, the whole "dynamic language" thing...)

So that was a simple example of point-to-point communications. Coming up: more complicated examples that show what parallel programming in CL-MPI is actually good for.


1 Warning -- if you actually try this, you'll have to kill all of the SBCL processes -- as can be expected when we have multiple Lisp processes all contending for the standard in/out, I/O in MPI when multiple REPLs get invoked is a messy issue. If you were puzzled about all this talk of running things from the command line in Lisp, it should be clear now -- the multi-process MPI runtime and the usual way of using Lisp with interactive REPLs don't mix too well. There can be some issues with debugging and thread-based parallelism, but with process-based parallelism such as MPI, the problem is worse. On the other hand, the REPLs do work in a more or less reasonable way, but you just have to be careful how you use them (maybe I'll write a post on parallel debugging at some point). Also, the debugging situation with CL-MPI is no worse than with any other MPI language binding -- this is the price to be paid for a framework that allows parallelism across different machines.

2 MPI Communicators (what "Comm" stands for) are more complex than that, but let's ignore that for now.

2 comments:

  1. Cool! I played with PVM on SBCL several years ago, but you are far ahead!

    ReplyDelete
  2. Thanks. I recalled downloading a PVM wrapper a while ago. I just googled it (LPVM) and it looks like your project -- too bad it didn't go further. I used PVM a long time ago, and PVM had some nice features, but for better or worse, MPI became the standard...

    ReplyDelete