Are pipe dreams made of promises?

unsorted — cgrand, 18 November 2009 @ 19 h 32 min

follow-up: pipe dreams aren’t necessarily made of promises

(defn pipe
 "Returns a pair: a seq (the read end) and a function (the write end).
  The function can takes either no arguments to close the pipe 
  or one argument which is appended to the seq. Read is blocking."
 []
  (let [promises (atom (repeatedly promise))
        p (second @promises)]
    [(lazy-seq @p) 
     (fn 
       ([] ;close the pipe
         (let [[a] (swap! promises #(vector (second %)))]
           (if a
             (deliver a nil) 
             (throw (Exception. "Pipe already closed")))))
       ([x] ;enqueue x
         (let [[a b] (swap! promises next)]
           (if (and a b)
             (do 
               (deliver a (cons x (lazy-seq @b)))
               x)
             (throw (Exception. "Pipe already closed"))))))]))

Beware of not printing the seq while the pipe is still open!

(let [[q f] (pipe)]
  (future (doseq [x q] (println x)) (println "that's all folks!"))
  (doseq [x (range 10)]
    (f x))
  (f)) ;close the pipe

An optimization job is never done

optimization — cgrand, @ 12 h 26 min

… when you don’t set measurable goals.

Yesterday, in Life of Brian I showed several tricks to make an implementation of Brian’s Brain faster. I tought it was enough but since some disagree here comes another perf boost.

Repetitive lookups

In neighbours-count the first arg (which happens to be a seq of 3 items) is destructured:

(defn neighbours-count [[above current below] i w]

which means that there’s 3 lookups per :off cell while these lookups results are constant for a given row!

So let’s move these lookups in step1:

(defn neighbours-count [above current below i w]
  (let [j (mod (inc i) w)
        k (mod (dec i) w)
        s #(if (= (aget #^objects %1 (int %2)) :on) 1 0)]
    (+ (+ (+ (s above j) (s above i))
         (+ (s above k) (s current j)))
      (+ (+ (s current k) (s below j))
        (+ (s below i) (s below k))))))
 
(defn step1 [[above #^objects current below]]
  (let [w (count current)]
    (loop [i (int (dec w)), #^objects row (aclone current)]
      (if (neg? i)
        row
        (let [c (aget current i)]
          (cond
            (= c :on)
              (recur (dec i) (doto row (aset i :dying)))
            (= c :dying)
              (recur (dec i) (doto row (aset i :off)))
            (= 2 (neighbours-count above current below i w))
              (recur (dec i) (doto row (aset i :on)))
            :else
              (recur (dec i) row)))))))

Benchmark:

user=> (do  (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 2058.372014 msecs"
nil
user=> (do  (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 1105.499303 msecs"
nil

Wow, twice as fast! Again. 2.1ms/step on a 80*80 board, 11.1ms/step on a 200*200 board.

This version on a 200*200 board is then nearly 10x faster than my first one!

Life of Brian

optimization — cgrand, 17 November 2009 @ 13 h 24 min

Some weeks ago, Lau Jensen implemented Brian’s Brain but gave up making it go faster. He reached for my advice but I was swamped in work.

First cut

Here is my take:

(defn neighbours-count [rows i w]
  (count (for [j [(mod (inc i) w) i (mod (dec i) w)]
               r rows :when (= :on (r j))] r)))

(defn step1 [rows]
  (let [current (second rows)
        w (count current)]
    (loop [i (dec w), row (transient current)]
      (if (neg? i)
        (persistent! row)
        (let [c (current i)]
          (cond
            (= c :on) 
              (recur (dec i) (assoc! row i :dying))
            (= c :dying) 
              (recur (dec i) (assoc! row i :off))
            (= 2 (neighbours-count rows i w))
              (recur (dec i) (assoc! row i :on))
            :else
              (recur (dec i) row)))))))

(defn step [board]
  (vec (map step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

A board is a vector of vectors (rows) containing either :on, :dying or :off.

If you look closely at step you should recognize torus-window from Lau’s Brians functional brain. I must confess I am the one who gave Lau the idea of the seq-heavy implementation — just to prove it was doable without indices.
This is why I chose to keep a clear, functional, indexless processing at the row level. It is only the cell level that is worth optimizing.

step1 steps a single row. It takes as only argument a seq of 3 rows (above, current and below) and use a transient vector to compute the next state of the current row. I tried to keep neighbours-count simple while reducing the number of intermediate seqs (less alloc, less GC, faster code) by using a seq comprehension (for) rather than several higher-order functions.

How does it run?

First a little utility to create random boards (with one third of :on cells):

(defn rand-board [w h]
  (vec (map vec (partition w (take (* w h) (repeatedly #(if (zero? (rand-int 3)) :on :off)))))))

Then the actual benchmark:

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 16734.609776 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 10200.273708 msecs"
nil

So I get 16.7ms/step on a 80*80 board (where Lau was stuck at 200ms on his box) and 102ms/step on a 200*200 board.

Optimizations

Unrolling neighbours-count

neighbours-count is called once for each :off cell and is rather complex so making it go faster should yield sensible gains this is why I unrolled it:

(defn neighbours-count [[above current below] i w]
  (let [j (mod (inc i) w)
        k (mod (dec i) w)
        s #(if (= (%1 %2) :on) 1 0)]
    (+ (+ (+ (s above j) (s above i))
         (+ (s above k) (s current j)))
      (+ (+ (s current k) (s below j))
        (+ (s below i) (s below k))))))

I paired the sum terms because currently only math ops with two args are inlined.

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 7880.492493 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 4679.679401 msecs"
nil

More than twice as fast! 7.9ms/step for a 80*80 board and 46.8ms/step for a 200*200 one.

Using java arrays

Since there is little sharing between two iterations, vectors are not that useful so why not to try using arrays to represent rows? The board is then a vector of arrays:

(defn neighbours-count [[above current below] i w]
  (let [j (mod (inc i) w)
        k (mod (dec i) w)
        s #(if (= (aget #^objects %1 (int %2)) :on) 1 0)]
    (+ (+ (+ (s above j) (s above i))
         (+ (s above k) (s current j)))
      (+ (+ (s current k) (s below j))
        (+ (s below i) (s below k))))))

(defn step1 [rows]
  (let [#^objects current (second rows)
        w (count current)]
    (loop [i (int (dec w)), #^objects row (aclone current)]
      (if (neg? i)
        row
        (let [c (aget current i)]
          (cond
            (= c :on) 
              (recur (dec i) (doto row (aset i :dying)))
            (= c :dying) 
              (recur (dec i) (doto row (aset i :off)))
            (= 2 (neighbours-count rows i w))
              (recur (dec i) (doto row (aset i :on)))
            :else
              (recur (dec i) row)))))))

(defn step [board]
  (vec (map step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

(defn rand-board [w h]
  (vec (map #(into-array Object %) (partition w (take (* w h) (repeatedly #(if (zero? (rand-int 3)) :on :off)))))))

Benchmark numbers:

user=> (do (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 5155.422268 msecs"
nil
user=> (do (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2964.524262 msecs"
nil

Arrays cut one third of the runtime: we are down to 5.2ms/step for a 80*80 board and 29.6ms/step for a 200*200 one. This means the overall speedup is greater than 3x!

Adding a swing frontend

I decoupled the rendering from the computation: 25 fps no matter how many steps are really computed. If the computation is fast not all steps are rendered (eg for a 80*80 board at 200 sps only one step out of eight is rendered to screen).

(import '(javax.swing JFrame JPanel) 
        '(java.awt Color Graphics2D))

(defn dims [[row :as board]] [(count row) (count board)])

(defn render-board [board #^Graphics2D g cell-size]
  (let [[w h] (dims board)]
    (doto g
      (.setColor Color/BLACK)
      (.scale cell-size cell-size)
      (.fillRect 0 0 w h))
    (doseq [i (range h) j (range w)]
      (when-let [color ({:on Color/WHITE :dying Color/GRAY} ((board i) j))]
        (doto g
          (.setColor color)
          (.fillRect j i 1 1))))))

(defn start-brian [board cell-size]
  (let [[w h] (dims board)
        switch (atom true)
        board (atom board)
        panel (doto (proxy [JPanel] []
                      (paint [g] (render-board @board g cell-size)))
                (.setDoubleBuffered true))]
    (doto (JFrame.) 
      (.addWindowListener (proxy [java.awt.event.WindowAdapter] []
                            (windowClosing [_] (reset! switch false))))
      (.setSize (* w cell-size) (* h cell-size))
      (.add panel)
      .show)
    (future (while @switch (swap! board step)))
    (future (while @switch (.repaint panel) (Thread/sleep 40)))))

(start-brian (rand-board 200 200) 5)

The rendering code is slightly different when using arrays:

(defn render-board [board #^Graphics2D g cell-size]
  (let [[w h] (dims board)]
    (doto g
      (.setColor Color/BLACK)
      (.scale cell-size cell-size)
      (.fillRect 0 0 w h))
    (doseq [i (range h) j (range w)]
      (when-let [color ({:on Color/WHITE :dying Color/GRAY} (aget #^objects (board i) (int j)))]
        (doto g
          (.setColor color)
          (.fillRect j i 1 1))))))

Putting all cores to work

The first attempt to parallelizing is to replace map by pmap in step.

(defn step [board]
  (vec (pmap step1 
         (partition 3 1 (concat [(peek board)] board [(first board)])))))

This does not yield such a gain:

user=> (do  (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 5624.407083 msecs"
nil
user=> (do  (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2778.74241 msecs"
nil

Performances are slighlty worse on the 80*80 board and slightly better on the 200*200. The reason is that the processing of a single row is not computationally intense enough compared to the parallelization overhead. So I have to send each core more stuff at once.

Rich is working on some cool features in the par branch that will make such parallelization a breeze meanwhile I am going to show how to manually parallelize step.

The idea is simply to split the vector in N chunks where N is the number of cores.

(defn step [board]
  (let [n (.availableProcessors (Runtime/getRuntime))
        l (count board)
        bounds (map int (range 0 (inc l) (/ l n)))]
    (apply into
      (pmap #(vec (map step1 (partition 3 1 
               (concat [(board (mod (dec %1) l))] 
                 (subvec board %1 %2) 
                 [(board (mod %2 l))]))))
        bounds (rest bounds)))))

How does it fare?

user=> (do  (let [b (rand-board 80 80)] (time ((apply comp (repeat 1000 step)) b))) nil)
"Elapsed time: 3672.935876 msecs"
nil
user=> (do  (let [b (rand-board 200 200)] (time ((apply comp (repeat 100 step)) b))) nil)
"Elapsed time: 2154.728325 msecs"
nil

3.7ms/step on a 80*80 board, 21.5ms/step on the 200*200 board. It is indeed better than plain pmap and cuts off 30% of the run time. Now this is nearly 5 times faster than my initial attempt!

You can view the whole source code (including variants) here and follow me on Twitter there.

See this subsequent post for more optimizations (another 2x improvement).

Enlive Google group

unsorted — cgrand, 15 October 2009 @ 21 h 36 min

I created a google group dedicated to Enlive — my HTML/XML CSS-based transformation and extraction library.

Taming multidim Arrays

interop — cgrand, @ 20 h 04 min

Utilities

(defn array? [x] (-> x class .isArray))
(defn see [x] (if (array? x) (map see x) x))

Creation

Blank and rectangular:

(see (make-array Double/TYPE 3 2))
; ((0.0 0.0) (0.0 0.0) (0.0 0.0))

From a nested collection:

(see (into-array (map (partial into-array Double/TYPE) [[1 2 3 4] [5 6]])))
; ((1.0 2.0 3.0 4.0) (5.0 6.0))

Reading

A multidim array:

(def dd (make-array Double/TYPE 1000 1000))

Naive method:

(time (dotimes [i 1000] (dotimes [j 1000] (aget dd i j))))
; "Elapsed time: 455.514388 msecs"

Step-by-step (allow inlining):

(time (dotimes [i 1000] (dotimes [j 1000] (-> dd (aget i) (aget j)))))
; "Elapsed time: 257.625829 msecs"

(Not so) Hinted step-by-step (#^doubles is ignored because of inline-expansion):

(time (dotimes [i 1000] (dotimes [j 1000] (-> #^objects dd (#^doubles aget i) (aget j)))))
; "Elapsed time: 157.095175 msecs"

Fully hinted step-by-step:

(time (dotimes [i 1000] (dotimes [j 1000] (let [#^doubles a (aget #^objects dd i)] (aget a j)))))
; "Elapsed time: 17.412548 msecs"

Let’s embody this knowledge in a nice macro:

(defmacro deep-aget
  ([hint array idx]
    `(aget ~(vary-meta array assoc :tag hint) ~idx))
  ([hint array idx & idxs]
    `(let [a# (aget ~(vary-meta array assoc :tag 'objects) ~idx)]
       (deep-aget ~hint a# ~@idxs))))

Does it work?

(time (dotimes [i 1000] (dotimes [j 1000] (deep-aget doubles dd i j))))
; "Elapsed time: 16.937279 msecs"

Updating

Naive method:

(time (dotimes [i 1000] (dotimes [j 1000] (aset dd i j 42.0))))
; "Elapsed time: 13859.404896 msecs"

Partially hinted step-by-step:

(time (dotimes [i 1000] (dotimes [j 1000] (-> #^objects dd (aget i) (aset j 42.0)))))
; "Elapsed time: 94.699536 msecs"

Fully hinted step-by-step (take 1):

(time (dotimes [i 1000] (dotimes [j 1000] (let [#^doubles a (aget #^objects dd i)] (aset a j 42.0)))))
; java.lang.IllegalArgumentException: More than one matching method found: aset

The compiler doesn’t know which aset to pick between (Object[], int, Object) and (double[], int, double) according to the hints (double[], int, Object). That’s why one needs to hint 42.0:

(time (dotimes [i 1000] (dotimes [j 1000] (let [#^doubles a (aget #^objects dd i)] (aset a j (double 42.0))))))
; "Elapsed time: 19.561983 msecs"

A nice macro to abstract away:

(defmacro deep-aset [hint array & idxsv]
  (let [hints '{doubles double ints int} ; writing a comprehensive map is left as an exercise to the reader
        [v idx & sxdi] (reverse idxsv)
        idxs (reverse sxdi)
        v (if-let [h (hints hint)] (list h v) v)
        nested-array (if (seq idxs)
                       `(deep-aget ~'objects ~array ~@idxs)
                        array)
        a-sym (with-meta (gensym "a") {:tag hint})]
      `(let [~a-sym ~nested-array]
         (aset ~a-sym ~idx ~v))))

Does it work?

(time (dotimes [i 1000] (dotimes [j 1000] (deep-aset doubles dd i j 42.0))))
; "Elapsed time: 19.439133 msecs"

(All tests run on jdk 1.6.0_14 32bit, options -server -XX:+DoEscapeAnalysis on a processor with a 6MB L2 cache.)

Clojure and Me has moved

unsorted — cgrand, 26 September 2009 @ 16 h 04 min

I’m sorry if my previous posts reappeared in your RSS reader. I just moved Clojure and me from blogger to cgrand.net.

About the subtitle

unsorted — cgrand, 13 August 2009 @ 10 h 54 min

The subtitle for this blog may be misleading. Actually it reads “When the pupil is ready to learn, a teacher will appear.”
I picked this Zen proverb because in early 2008 I was looking for a language with the following requirements: strong metaprogrammability, strong concurrency support, strong functional bias and bonus points for running on the JVM. I had prepared myself to not find the rare bird when I met Clojure (and it was nearly love at first sight).
So in that perspective I am the pupil and Clojure (the language, Rich Hickey and the whole community) the teacher.

Clojure as fast as Java!

unsorted — cgrand, 6 August 2009 @ 17 h 08 min

I wrote a prototypal implementation of IPersistentSet in Clojure written with new new and, surprisingly, on its first benchmark it’s already on par with Java-based implementations.

(dotimes [i 10]
  (let [n (int (Math/pow 2 (+ 10 i)))
        _ (println "n=" n)
        a (do
            (print "clojure\t")
            (time (doall (let [s (reduce conj empty-set (range 0 n 2))] (map #(get s %) (range n))))))
        b (do
            (print "java\t")
            (time (doall (let [s (reduce conj #{} (range 0 n 2))] (map #(get s %) (range n))))))]
    (println (= a b))))

;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;;

n= 1024
clojure "Elapsed time: 1.006064 msecs"
java "Elapsed time: 0.736197 msecs"
true
n= 2048
clojure "Elapsed time: 1.813009 msecs"
java "Elapsed time: 1.328102 msecs"
true
n= 4096
clojure "Elapsed time: 3.590191 msecs"
java "Elapsed time: 2.608153 msecs"
true
n= 8192
clojure "Elapsed time: 7.046566 msecs"
java "Elapsed time: 5.807302 msecs"
true
n= 16384
clojure "Elapsed time: 16.015862 msecs"
java "Elapsed time: 10.284897 msecs"
true
n= 32768
clojure "Elapsed time: 29.803928 msecs"
java "Elapsed time: 23.850378 msecs"
true
n= 65536
clojure "Elapsed time: 68.79778 msecs"
java "Elapsed time: 63.947582 msecs"
true
n= 131072
clojure "Elapsed time: 132.082499 msecs"
java "Elapsed time: 113.433411 msecs"
true
n= 262144
clojure "Elapsed time: 292.149631 msecs"
java "Elapsed time: 265.39197 msecs"
true
n= 524288
clojure "Elapsed time: 595.265321 msecs"
java "Elapsed time: 698.711009 msecs"
true

To tell the truth, this benchmark is a bit unfair for Java: no dummy map and slightly different algorithms (half-full bitmap nodes become array-nodes, no leaf-node etc.).
Follow me on Twitter, here

What *warn-on-reflection* doesn’t tell you about arrays

optimization — cgrand, @ 9 h 12 min
user=> (time (let [a (make-array Object 100)] (dotimes [i 10000000] (aset a (rem i 100) nil))))
Reflection warning, NO_SOURCE_PATH:840 - call to aset can't be resolved.
"Elapsed time: 136063.015272 msecs"

With aget/aset, the index must be hinted to be an int:

user=> (time (let [a (make-array Object 100)] (dotimes [i 10000000] (aset a (int (rem i 100)) nil))))
"Elapsed time: 1064.546402 msecs"

Wow, more than 100x faster (reflection is bad) but despite the compiler doesn’t complain one can help it to choose a faster path:

user=> (time (let [a #^"[Ljava.lang.Object;" (make-array Object 100)] (dotimes [i 10000000] (aset a (int (rem i 100)) nil))))
"Elapsed time: 247.446882 msecs"

On the whole we get a 500x speed-up with only two type hints.

Everybody loves the Sieve of Eratosthenes

unsorted — cgrand, 30 July 2009 @ 11 h 27 min

If I judge by traffic logs for this blog, many newcomers want to compute prime numbers in Clojure.

A recent thread on the mailing list prompted me to test an idea I had for a while: to use a map to implement the Sieve of Eratosthenes.

The first implementation was this one:

(defn primes [max]
  (let [enqueue (fn [sieve n factor]
                  (let [m (+ n factor)]
                    (assoc sieve m
                      (conj (sieve m) factor))))
        next-sieve (fn [sieve candidate]
                     (if-let [factors (sieve candidate)]
                       (reduce #(enqueue %1 candidate %2)
                         (dissoc sieve candidate)
                         factors)
                       (enqueue sieve candidate candidate)))]
    (apply concat (vals (reduce next-sieve {} (range 2 max))))))

where the sieve is a map from the next non-prime numbers to their factors. It’s so naive that even numbers are tested for primality but it doesn’t perform that bad: on my box, it takes 3s to compute all primes below 1,000,000 (it’s roughly as fast as clojure.contrib.lazy-seqs/primes when the seq isn’t yet memoized).

I wasn’t happy with the way I handled the case where a non-prime was already checked off (ie was a key of the map): I was conjing onto the list of prime factors for this number. If instead I tried to check off n+p (where n is the already known non-prime and p the current prime) or n+2p or n+3p… until I found a yet unknown non-prime I wouldn’t need to maintain a list of factors, nor to conj or reduce over it. And I was hoping that less allocations would yield better perfs.

Here is the second iteration:

(defn primes2 [max]
  (let [enqueue (fn [sieve n factor]
                  (let [m (+ n factor)]
                    (if (sieve m)
                      (recur sieve m factor)
                      (assoc sieve m factor))))
        next-sieve (fn [sieve candidate]
                     (if-let [factor (sieve candidate)]
                       (-> sieve
                         (dissoc candidate)
                         (enqueue candidate factor))
                       (enqueue sieve candidate candidate)))]
    (vals (reduce next-sieve {} (range 2 max)))))

and it computes all the primes below 1,000,000 in 1.8s instead of 3s but it still tests even numbers for primality.

primes3 is primes2 modified to only test odd numbers:

(defn primes3 [max]
  (let [enqueue (fn [sieve n factor]
                  (let [m (+ n (+ factor factor))]
                    (if (sieve m)
                      (recur sieve m factor)
                      (assoc sieve m factor))))
        next-sieve (fn [sieve candidate]
                     (if-let [factor (sieve candidate)]
                       (-> sieve
                         (dissoc candidate)
                         (enqueue candidate factor))
                       (enqueue sieve candidate candidate)))]
    (cons 2 (vals (reduce next-sieve {} (range 3 max 2))))))

and it computes the same list of primes in 1.5s.

Out of curiosity, I wrote a lazy version of primes3:

(defn lazy-primes3 []
  (letfn [(enqueue [sieve n step]
            (let [m (+ n step)]
              (if (sieve m)
                (recur sieve m step)
                (assoc sieve m step))))
          (next-sieve [sieve candidate]
            (if-let [step (sieve candidate)]
              (-> sieve
                (dissoc candidate)
                (enqueue candidate step))
              (enqueue sieve candidate (+ candidate candidate))))
          (next-primes [sieve candidate]
            (if (sieve candidate)
              (recur (next-sieve sieve candidate) (+ candidate 2))
              (cons candidate 
                (lazy-seq (next-primes (next-sieve sieve candidate) 
                            (+ candidate 2))))))]
    (cons 2 (lazy-seq (next-primes {} 3)))))

and, surprisingly (better locality?), it computes the primes below 1,000,000 in 1s.

« Previous PageNext Page »
(c) 2024 Clojure and me | powered by WordPress with Barecity