Friday, May 29, 2009

CFFI vs UFFI performance

The current, de facto portable foreign function interface (FFI) for Common Lisp is CFFI. An alternative to CFFI is UFFI, a slightly older FFI created by Kevin Rosenberg.

The CFFI package includes a compatibility layer for UFFI called CFFI-UFFI-COMPAT, which can be used as a drop-in replacement for UFFI -- in other words, if CFFI is installed on our machine, we can use systems that depend on UFFI simply by changing the dependency in the ASDF system definition from "UFFI" to "CFFI-UFFI-COMPAT". The obvious benefit is that installing UFFI becomes unnecessary.

A surprising side benefit can be improved performance.

I recently tried using CL-GD, a FFI wrapper for the GD graphics library. CL-GD uses UFFI. I implemented a simple function which compares the RGB components of both images.

(defun square (x)
  (the fixnum (* (the fixnum x) (the fixnum x))))

(defun compare-images (image1 image2)
 "Compare two images -- uses API exported by CL-GD"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img-color (cl-gd:get-pixel y x :image image2))
                     (target-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :red img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :red target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :blue img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :blue target-color :image image1)))))
                 (incf err (square (- (the fixnum (cl-gd:color-component 
                                                        :green img-color :image image2))
                                      (the fixnum (cl-gd:color-component 
                                                        :green target-color :image image1))))))))
   err))

First, the timing with the standard CL-GD: (SBCL 1.0.28, CL-GD 0.5.6, UFFI-1.6.1, 3GHz Intel Core2 Duo, Linux))

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 1.684 seconds of real time
 1.688105 seconds of total run time (1.688105 user, 0.000000 system)
 100.24% CPU
 5,061,067,326 processor cycles
 141,296 bytes consed
0

1.68 seconds to compare a pair of 1024x768 images is extremely slow. Since iterating over a couple of 1024x768 arrays and computing their difference in Common Lisp doesn't take anything close to 1.7 seconds (see below), it seemed that the bottleneck must be in the FFI.

For many applications, the FFI overheads don't really matter that much, because the goal of using the FFI is to use a library written in another language without having to reimplement the library in Lisp. On the other hand, another reason for using foreign libraries is to use fast libraries implemented in C. Since CFFI is being used for low-level libraries such as IOLib, where performance matters, I thought I'd try using the UFFI compatibility layer for CFFI.

All I did was edit the cl-gd.asd file so that the dependency on UFFI was replaced by a dependency on CFFI-UFFI-COMPAT (CL-GD already depends on CFFI-UFFI-COMPAT for CLISP and OpenMCL, but by default, depends on UFFI for SBCL).

The improvement wasimpressive - 0.336 seconds using CFFI-UFFI-COMPAT, comapred to 1.68 seconds with UFFI, a 5x speedup.

EVO-LISA> (progn
 (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
 (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
 (time (compare-images *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.336 seconds of real time
 0.336021 seconds of total run time (0.336021 user, 0.000000 system)
 100.00% CPU
 1,010,111,976 processor cycles
 21,712 bytes consed

Of course, there is still a massive overhead due to accessing pixel information through the CL-GD API, which is partially due to the FFI layer, partially due to GD, and partially due to CL-GD. How large is this overhead? Here's a simple function which compares two color arrays written in pure Common Lisp: This is not even that optimized -- it doesn't even declare the types of image1 and image2.

(defstruct color
  (red 0 :type fixnum)
  (blue 0 :type fixnum)
  (green 0 :type fixnum))
(defun compare-2darrays (image1 image2 height width)
 "Compare two images -- uses API exported by CL-GD"
 (declare (fixnum  height width))
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
          (loop for x fixnum from 0 below width do
               (let ((color1 (aref image1 y x))
                     (color2 (aref image2 y x)))
                 (incf err (square (- (color-red color1) (color-red color2))))
                 (incf err (square (- (color-blue color1) (color-blue color2))))
                 (incf err (square (- (color-green color1) (color-green color2)))))))))
The results?
CL-USER> (let ((img1 (make-array '(1024 768) :initial-element (make-color)))
               (img2 (make-array '(1024 768) :initial-element (make-color))))
           (time (compare-2darrays img1 img2 1024 768)))
Evaluation took:
  0.048 seconds of real time
  0.048003 seconds of total run time (0.048003 user, 0.000000 system)
  100.00% CPU
  143,562,681 processor cycles
  0 bytes consed  

0.048 seconds to perform basically the same operations (iterate over both arrays, computing the square of their difference, compared to 0.336 seconds for the code which accesses CL-GD via CFFI.

I considered the overhead within CFFI/UFFI, but there's another source of inefficiency, which is the CL-GD binding code. As clearly stated in the docs, CL-GD wasn't designed for speed, and in order to provide a nice API, there is some overhead in the binding code. I'm abusing CL-GD by using it for repetitive pixel-level access because GD is a image creation library and not an image processing library, but I needed to load and create some GIF files, and the first library I ran into was CL-GD.

Here is another version of the CL-GD based code where I bypass the officially exported CL-GD API and use some internal functions (I know some people don't like it, but being able to access internal symbols with "::" is a great feature of Common Lisp, imho).

(defun compare-images-raw (image1 image2)
 "Compare two images -- use internal functions to eliminate overhead
added by CL-GD API"
 (declare (optimize (debug 0) (speed 3)(safety 0)))
 (let ((err 0)
       (height (cl-gd:image-height image1))
       (width (cl-gd:image-width image1))
       (img1 (cl-gd::img image1))
       (img2 (cl-gd::img image2)))
   (declare (fixnum err height width))
   (loop for y fixnum from 0 below height do
         (loop for x fixnum from 0 below width do
               (let ((img2-color (cl-gd:get-pixel y x :image image2))
                     (img1-color (cl-gd:get-pixel y x :image image1)))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-red img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-red img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-green img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-green img1 img1-color)))))
                 (incf err (square (- (the fixnum(cl-gd::gd-image-get-blue img2 img2-color))
                                      (the fixnum(cl-gd::gd-image-get-blue img1 img1-color))))))))
   err))
The results:
EVO-LISA> (progn
           (defparameter *image1* (cl-gd:create-image-from-file "023.JPG"))
           (defparameter *image2* (cl-gd:create-image-from-file "023.JPG"))
           (time (compare-images-raw *image1* *image2*)))

height1=1024, height2=768
Evaluation took:
 0.282 seconds of real time
 0.284018 seconds of total run time (0.284018 user, 0.000000 system)
 100.71% CPU
 845,516,097 processor cycles
 21,936 bytes consed
So we're down to 0.282 seconds from 0.336 seconds. Noticeable, but still a long ways to go from the raw CL code. If this was an important enough example (it's not), I could try, for exmaple, eliminating the FFI overhead by digging around the GD API and possibly modifying it to implement a quicker way to grab an entire image at once, instead of pixel by pixel.

I haven't looked deeper into why CFFI turned out to be 5x faster than UFFI for this example, and there may be examples where UFFI is faster. But the lesson remains the same: (1) FFI can in the wrapper libraries we use can be a peformance bottleneck, (2) CFFI-UFFI-COMPAT offers a quick way to swap in CFFI for UFFI.

I sent a summary of the above to Edi Weitz, author of CL-GD (and many other great libraries!), and he suggested that I send a patch, after more testing -- unfortunately, CL-GD failed a couple of tests in the test suite and I don't have time to dig around right now. Maybe a little later...

No comments:

Post a Comment