Sunday, December 20, 2009

closing in on the SBCL thread mystery?

I have distilled the messy code from yesterday into the minimal chunk of code that will cause bizarre multi-thread behavior on SBCL 1.0.33
(defun compute-thread (thread-num rows-per-proc min max mul)
   ;; a pretty straightforward computation
  (let* ((local-min (+ min (* thread-num rows-per-proc)))
     (local-max (1- (+ local-min rows-per-proc)))
     (local-count 0))
    (loop for i from local-min upto local-max
      do (loop for j from min upto max
           do (incf local-count
               (let ((xc (* mul i)))
             (loop for count from 0 below 100
                   do (when (>=  xc 4)
                    (return count))
                   (incf xc)
                   finally (return 100)))
    #+nil(format *out* "Thread ~a local-min=~a, local-max=~a local-count=~d~%"
        thread-num local-min local-max local-count)

(defun main (num-threads)
  ;; spawn some threads and sum the results
  (loop with rows-per-proc = (/ 100 num-threads)
    for thread in
    (loop for thread-num from 0 below num-threads
          (let ((thread-num thread-num));establish private binding of thread-num for closure
        (sb-thread:make-thread (lambda ()
                     (compute-thread thread-num rows-per-proc -250 250 0.008d0)))))
    summing (sb-thread:join-thread thread)))

(defun test (num-threads num-iterations expected-val)
  ;; this is just a test wrapper which tests that the result of main is consistent
  (loop for i from 0 below num-iterations do
    (format t "Run ~a:" i)
    (let ((result (main num-threads)))
      (format t "result=~a~%" result)
      (assert (=  expected-val result)))))
I expect: CL-USER> (test 10 1000 (main 1)) to complete without assertions, because the result of (main num-threads) should always be the same as the result of (main 1) Unfortunately:
CL-USER> (test 10 1000 (main 1)) ;; test 10 threads
Run 0:result=300600
Run 1:result=300600
Run 2:result=300600
Run 3:result=300600
Run 494:result=300600
Run 495:result=300600
Run 496:result=300600
Run 497:result=300600
Run 498:result=300601   ;;<---- Oh no!!!
; Evaluation aborted.
Arghh... ;-( As a sanity check: (test 1 1000 (main 1)) completes without problems -- in other words, 1 thread always seems to computes the same answer, so it seems to be a multi-thread problem.

No comments:

Post a Comment