Life of Brian

optimization — cgrand, 17 November 2009 @ 13 h 24 min

Some weeks ago, Lau Jensen implemented Brian’s Brain but gave up making it go faster. He reached for my advice but I was swamped in work.

First cut

Here is my take:

(defn neighbours-count [rows i w]
  (count (for [j [(mod (inc i) w) i (mod (dec i) w)]
               r rows :when (= :on (r j))] r)))

(defn step1 [rows]
  (let [current (second rows)
        w (count current)]
    (loop [i (dec w), row (transient current)]
      (if (neg? i)
        (persistent! row)
        (let [c (current i)]
          (cond
            (= c :on) 
              (recur (dec i) (assoc! row i :dying))
            (= c :dying) 
              (recur (dec i) (assoc! row i :off))
            (= 2 (neighbours-count rows i w))
              (recur (dec i) (assoc! row i :on))
            :else
              (recur (dec i) row)))))))

(defn step [board]
  (vec (map step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

A board is a vector of vectors (rows) containing either :on, :dying or :off.

If you look closely at step you should recognize torus-window from Lau’s Brians functional brain. I must confess I am the one who gave Lau the idea of the seq-heavy implementation — just to prove it was doable without indices.
This is why I chose to keep a clear, functional, indexless processing at the row level. It is only the cell level that is worth optimizing.

step1 steps a single row. It takes as only argument a seq of 3 rows (above, current and below) and use a transient vector to compute the next state of the current row. I tried to keep neighbours-count simple while reducing the number of intermediate seqs (less alloc, less GC, faster code) by using a seq comprehension (for) rather than several higher-order functions.

How does it run?

First a little utility to create random boards (with one third of :on cells):

(defn rand-board [w h]
  (vec (map vec (partition w (take (* w h) (repeatedly #(if (zero? (rand-int 3)) :on :off)))))))

Then the actual benchmark:

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 16734.609776 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 10200.273708 msecs"
nil

So I get 16.7ms/step on a 80*80 board (where Lau was stuck at 200ms on his box) and 102ms/step on a 200*200 board.

Optimizations

Unrolling neighbours-count

neighbours-count is called once for each :off cell and is rather complex so making it go faster should yield sensible gains this is why I unrolled it:

(defn neighbours-count [[above current below] i w]
  (let [j (mod (inc i) w)
        k (mod (dec i) w)
        s #(if (= (%1 %2) :on) 1 0)]
    (+ (+ (+ (s above j) (s above i))
         (+ (s above k) (s current j)))
      (+ (+ (s current k) (s below j))
        (+ (s below i) (s below k))))))

I paired the sum terms because currently only math ops with two args are inlined.

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 7880.492493 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 4679.679401 msecs"
nil

More than twice as fast! 7.9ms/step for a 80*80 board and 46.8ms/step for a 200*200 one.

Using java arrays

Since there is little sharing between two iterations, vectors are not that useful so why not to try using arrays to represent rows? The board is then a vector of arrays:

(defn neighbours-count [[above current below] i w]
  (let [j (mod (inc i) w)
        k (mod (dec i) w)
        s #(if (= (aget #^objects %1 (int %2)) :on) 1 0)]
    (+ (+ (+ (s above j) (s above i))
         (+ (s above k) (s current j)))
      (+ (+ (s current k) (s below j))
        (+ (s below i) (s below k))))))

(defn step1 [rows]
  (let [#^objects current (second rows)
        w (count current)]
    (loop [i (int (dec w)), #^objects row (aclone current)]
      (if (neg? i)
        row
        (let [c (aget current i)]
          (cond
            (= c :on) 
              (recur (dec i) (doto row (aset i :dying)))
            (= c :dying) 
              (recur (dec i) (doto row (aset i :off)))
            (= 2 (neighbours-count rows i w))
              (recur (dec i) (doto row (aset i :on)))
            :else
              (recur (dec i) row)))))))

(defn step [board]
  (vec (map step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

(defn rand-board [w h]
  (vec (map #(into-array Object %) (partition w (take (* w h) (repeatedly #(if (zero? (rand-int 3)) :on :off)))))))

Benchmark numbers:

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 5155.422268 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2964.524262 msecs"
nil

Arrays cut one third of the runtime: we are down to 5.2ms/step for a 80*80 board and 29.6ms/step for a 200*200 one. This means the overall speedup is greater than 3x!

Adding a swing frontend

I decoupled the rendering from the computation: 25 fps no matter how many steps are really computed. If the computation is fast not all steps are rendered (eg for a 80*80 board at 200 sps only one step out of eight is rendered to screen).

(import '(javax.swing JFrame JPanel) 
        '(java.awt Color Graphics2D))

(defn dims [[row :as board]] [(count row) (count board)])

(defn render-board [board #^Graphics2D g cell-size]
  (let [[w h] (dims board)]
    (doto g
      (.setColor Color/BLACK)
      (.scale cell-size cell-size)
      (.fillRect 0 0 w h))
    (doseq [i (range h) j (range w)]
      (when-let [color ({:on Color/WHITE :dying Color/GRAY} ((board i) j))]
        (doto g
          (.setColor color)
          (.fillRect j i 1 1))))))

(defn start-brian [board cell-size]
  (let [[w h] (dims board)
        switch (atom true)
        board (atom board)
        panel (doto (proxy [JPanel] []
                      (paint [g] (render-board @board g cell-size)))
                (.setDoubleBuffered true))]
    (doto (JFrame.) 
      (.addWindowListener (proxy [java.awt.event.WindowAdapter] []
                            (windowClosing [_] (reset! switch false))))
      (.setSize (* w cell-size) (* h cell-size))
      (.add panel)
      .show)
    (future (while @switch (swap! board step)))
    (future (while @switch (.repaint panel) (Thread/sleep 40)))))

(start-brian (rand-board 200 200) 5)

The rendering code is slightly different when using arrays:

(defn render-board [board #^Graphics2D g cell-size]
  (let [[w h] (dims board)]
    (doto g
      (.setColor Color/BLACK)
      (.scale cell-size cell-size)
      (.fillRect 0 0 w h))
    (doseq [i (range h) j (range w)]
      (when-let [color ({:on Color/WHITE :dying Color/GRAY} (aget #^objects (board i) (int j)))]
        (doto g
          (.setColor color)
          (.fillRect j i 1 1))))))

Putting all cores to work

The first attempt to parallelizing is to replace map by pmap in step.

(defn step [board]
  (vec (pmap step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

This does not yield such a gain:

user=> (do  (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 5624.407083 msecs"
nil
user=> (do  (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2778.74241 msecs"
nil

Performances are slighlty worse on the 80*80 board and slightly better on the 200*200. The reason is that the processing of a single row is not computationally intense enough compared to the parallelization overhead. So I have to send each core more stuff at once.

Rich is working on some cool features in the par branch that will make such parallelization a breeze meanwhile I am going to show how to manually parallelize step.

The idea is simply to split the vector in N chunks where N is the number of cores.

(defn step [board]
  (let [n (.availableProcessors (Runtime/getRuntime))
        l (count board)
        bounds (map int (range 0 (inc l) (/ l n)))]
    (apply into
      (pmap #(vec (map step1 (partition 3 1 
               (concat [(board (mod (dec %1) l))] 
                 (subvec board %1 %2) 
                 [(board (mod %2 l))]))))
        bounds (rest bounds)))))

How does it fare?

user=> (do  (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 3672.935876 msecs"
nil
user=> (do  (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2154.728325 msecs"
nil

3.7ms/step on a 80*80 board, 21.5ms/step on the 200*200 board. It is indeed better than plain pmap and cuts off 30% of the run time. Now this is nearly 5 times faster than my initial attempt!

You can view the whole source code (including variants) here and follow me on Twitter there.

See this subsequent post for more optimizations (another 2x improvement).

8 Comments »

  1. Great writeup, thanks for posting that. It might be fun for someone to implement the Hashlife algorithm (http://en.wikipedia.org/wiki/Hashlife) on top of this.

    Thanks for the inspiration!
    Patrick

    Comment by Patrick — 17 November 2009 @ 14 h 05 min
  2. Excellent work Christophe! Probably the best single post on optimizing Clojure.

    Comment by dnolen — 17 November 2009 @ 20 h 29 min
  3. Nice one CG :)

    Comment by Timothy Pratley — 18 November 2009 @ 3 h 22 min
  4. [...] in Life of Brian I showed several tricks to make an implementation [...]

  5. Very cool. Good work testing everything one piece at a time too.

    Just curious – what happened when you put all the optimizations together?

    Comment by devlinsf — 18 November 2009 @ 16 h 57 min
  6. @devlinsf each timing include the current optimization *and* the previous ones so the net speed-up is 5x for this post, 10x if you include its sequel.

    Comment by cgrand — 18 November 2009 @ 17 h 16 min
  7. [...] Life of Brian [...]

  8. [...] with only 2 outliers in 60 samples. This is a very good result and similar in speed to Christophes Brians Brain sim, which also did 4 neighbor look-ups per cell in about 2 ms. But while Criterium tells us how [...]

RSS feed for comments on this post. TrackBack URI

Leave a comment

(c) 2014 Clojure and me | powered by WordPress with Barecity