Sunday, December 20, 2009

closing in on the SBCL thread mystery?

I have distilled the messy code from yesterday into the minimal chunk of code that will cause bizarre multi-thread behavior on SBCL 1.0.33
(defun compute-thread (thread-num rows-per-proc min max mul)
   ;; a pretty straightforward computation
  (let* ((local-min (+ min (* thread-num rows-per-proc)))
     (local-max (1- (+ local-min rows-per-proc)))
     (local-count 0))
    (loop for i from local-min upto local-max
      do (loop for j from min upto max
           do (incf local-count
               (let ((xc (* mul i)))
             (loop for count from 0 below 100
                   do (when (>=  xc 4)
                    (return count))
                   (incf xc)
                   finally (return 100)))
               )))
    #+nil(format *out* "Thread ~a local-min=~a, local-max=~a local-count=~d~%"
        thread-num local-min local-max local-count)
    local-count))


(defun main (num-threads)
  ;; spawn some threads and sum the results
  (loop with rows-per-proc = (/ 100 num-threads)
    for thread in
    (loop for thread-num from 0 below num-threads
          collect
          (let ((thread-num thread-num));establish private binding of thread-num for closure
        (sb-thread:make-thread (lambda ()
                     (compute-thread thread-num rows-per-proc -250 250 0.008d0)))))
    summing (sb-thread:join-thread thread)))

(defun test (num-threads num-iterations expected-val)
  ;; this is just a test wrapper which tests that the result of main is consistent
  (loop for i from 0 below num-iterations do
    (format t "Run ~a:" i)
    (let ((result (main num-threads)))
      (format t "result=~a~%" result)
      (assert (=  expected-val result)))))
I expect: CL-USER> (test 10 1000 (main 1)) to complete without assertions, because the result of (main num-threads) should always be the same as the result of (main 1) Unfortunately:
CL-USER> (test 10 1000 (main 1)) ;; test 10 threads
Run 0:result=300600
Run 1:result=300600
Run 2:result=300600
Run 3:result=300600
.
.  
.
Run 494:result=300600
Run 495:result=300600
Run 496:result=300600
Run 497:result=300600
Run 498:result=300601   ;;<---- Oh no!!!
; Evaluation aborted.
Arghh... ;-( As a sanity check: (test 1 1000 (main 1)) completes without problems -- in other words, 1 thread always seems to computes the same answer, so it seems to be a multi-thread problem.

Saturday, December 19, 2009

A SBCL multi-threaded computation mystery

Update: 2009/12/20 -- this is obsolete -- I now have a a much cleaner, shorter, piece of code which replicates the bizarre behavior below, in today's post. I ran into this thread-related mystery a while ago, forgot all about it, and then ran into the problem again today on a similar chunk of code and wasted a bunch of time, so I'll put it up here. Maybe somebody with experience in SBCL+threads is reading and could give me a hint ;-) Here is a multithreaded version of the simple Mandelbrot benchmark which I earlier parallelized using CL-MPI:
(defconstant +resolution+ 500)
(defconstant +iterations+ 100)
(defparameter *start-time* nil )  
(defparameter *out* *standard-output*)

(defun iters (max xc yc)
  (let ((x xc)
        (y yc))
    (declare (double-float x y)
             (fixnum max))
    (loop for count from 0 below max
          do (when (>= (+ (* x x) (* y y)) 4.0d0)
               (return-from iters count))
             (let ((tmp (+ (- (* x x) (* y y)) xc)))
               (setf y (+ (* 2.0d0 x y) yc)
                     x tmp)))
    max))

(defun compute-thread (thread-num rows-per-proc min max mul)
  (declare (fixnum thread-num rows-per-proc min max))
  (declare (type double-float  mul))
  (let* ((local-min (+ min (the fixnum (* thread-num rows-per-proc))))
     (local-max (1- (+ local-min rows-per-proc)))
     (local-count 0))
    (declare (fixnum local-min local-max local-count))
    (loop for i fixnum from local-min upto local-max
      do (loop for j fixnum from min upto max
           do (incf local-count
               #+call-func (iters +iterations+ (* mul i) (* mul j))
               #-call-func
               (let* ((xc (* mul i))
                  (yc (* mul j))
                  (x xc)
                  (y yc))
                  (declare (double-float x y))
                  (loop for count from 0 below +iterations+
                          do (cond((>= (+ (* x x) (* y y)) 4.0d0)
                                          (return count))
                                        (t
                                         (let ((tmp (+ (- (* x x) (* y y)) xc)))
                                            (setf y (+ (* 2.0d0 x y) yc)
                                                  x tmp))))
                        finally (return +iterations+)))
               )))
    (format *out* "Thread ~a local-min=~a, local-max=~a local-count=~d time ~,4f~%"
        thread-num local-min local-max local-count
        (float (/ (- (get-internal-run-time) *start-time*) internal-time-units-per-second)))
    local-count))

(defun mandelbrot-main (num-threads)
  "straightforward parallelization which partitions the work into equal regions,
   and each region is assigned to a process."
  (declare (fixnum num-threads))
  (format t "---~%")
  (setf *start-time* (get-internal-run-time))
  (let* ((max (truncate +resolution+ 2))
         (min (- max))
         (mul (/ 2.0d0 max)))
    ;; spawn computation threads
    (loop with rows-per-proc = (/ +resolution+ num-threads)
      for thread in
      (loop for thread-num from 0 below num-threads
        collect
        (let ((thread-num thread-num)); needed to establish private binding for closure
          (sb-thread:make-thread (lambda ()
                       (compute-thread thread-num rows-per-proc min max mul)))))
      summing (sb-thread:join-thread thread))))

Note that in compute-thread, there is a conditionally compiled chunk of code: if *features* contains :call-func, the function iters is called. However, if :call-func is not in *features*, then instead, I compile a chunk of code which performs the exact same computation as the iters function. This code should behave identically regardless of #+call-func or #-call-func, right? Wrong -- apparently... ;-( The mandelbrot-main function returns the number of points which are determined to be in the Mandelbrot set, so here is a test function to verify that the computation is consistent and always computes the same, expected value:
(defun test (num-threads num-iterations expected-val)
  (loop for i from 0 below num-iterations do
    (format t "Run ~a:" i)
    (let ((result (mandelbrot-main num-threads)))
      (format t "result=~a~%" result)
      (assert (=  expected-val result)))))
For +resolutions+ = 500 and +iterations+ = 100, the result should be 2905274. When I compiled with #-call-func, (test 10 1000 2905274) completes without asserting an inconsistenty. However, when compiled with #+call-func, (test 10 1000 2905274) asserts an inconsistency after some non-deterministic number of repetitions. e.g.,
CL-USER> (test 10 1000 2905274)
Run 0:---
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0000
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0000
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0120
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0200
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0400
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0400
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0400Thread
8 local-min=150, local-max=199 local-count=17788 time 0.0400
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0440
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0480
result=2905274
.
. 
.
Run 146:---
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0000
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0000
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0120
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0200
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0240
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0240
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0280
Thread 8 local-min=150, local-max=199 local-count=17788 time 0.0360
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0360
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0440
result=2905274
Run 147:---
Thread 0 local-min=-250, local-max=-201 local-count=29924 time 0.0080
Thread 1 local-min=-200, local-max=-151 local-count=103224 time 0.0080
Thread 2 local-min=-150, local-max=-101 local-count=364384 time 0.0120
Thread 8 local-min=150, local-max=199 local-count=17788 time 0.0280
Thread 6 local-min=50, local-max=99 local-count=52917 time 0.0360
Thread 3 local-min=-100, local-max=-51 local-count=596147 time 0.0360
Thread 9 local-min=200, local-max=249 local-count=10354 time 0.0360
Thread 7 local-min=100, local-max=149 local-count=24513 time 0.0360
Thread 5 local-min=0, local-max=49 local-count=727894 time 0.0360
Thread 4 local-min=-50, local-max=-1 local-count=978228 time 0.0400
result=2905373    <--- Uh, oh!!!!!!!!!!!
.
. and SLIME complains:
The assertion (= EXPECTED-VAL RESULT) failed.
Yikes! Looking closer, it looks like Thread 3 has a discrepancy: In the "correct" Runs (i.e., for the first 146 Runs), the output from Thread 3 was:
Thread 3 local-min=-100, local-max=-51 local-count=596048 time 0.0200
but on the 147th Run, the output is:
Thread 3 local-min=-100, local-max=-51 local-count=596147 time 0.0360
I have absolutely no idea what could be causing this. The most baffling thing is that when compiled with #-call-func , there is no problem -- that is, calling the iter function instead of executing an inlined version of the same computation seems to be responsible for causing this problem! Very, very bizarre because the iter function and its caller don't touch any special variables or do anything else which looks unsafe to me. In addition, there's no problem if I compile with #+call-func but running but only use 1 thread
(test 1 1000 2905274)
. If I use more than 1 thread, the inconsistency eventually shows up (the more threads, the earlier the problem shows up..). So, this must be something related to multithreading. Is it possible that this is a bug in SBCL, or am I making some silly thread-related mistake here? (I'm running this with SBCL1.0.33 and SBCL1.0.32, x86-64)