Decaying lists: log scale for lists

pondering — cgrand, 12 February 2013 @ 13 h 44 min

The concept of exponentially decaying lists by David Barbour piqued my interest and I implemented it. It can be summed up as: log scale for lists!

As for log scale, decaying lists allow to manage large range of values and thus to get a better understanding of the data.

So a decaying list grows logarithmically with the number of conjed items. It follows that some items are dropped when others are inserted. Let’s see:

=> (seq (into (decaying-list) (range 100)))
(99 98 97 96 95 94 92 91 90 89 88 87 86 84 83 80 70 65 61 59 57 49 44 30 29 27 25 23 22 18 12 4 0)

So most of the recent items are there but the older elements have a greater probability to have been dropped.

A decaying list is parametrized by its half-life. The default half-life is 20, it means that an item in the list has one chance out of two to “survive” the insertion of 20 other items.

=> (seq (into (decaying-list 5) (range 100))) ; a half-life of 5 conjs
(99 98 97 96 95 94 93 91 83 81 79 78 68 61 57 54 52 31 15)

You can know where the next decay (if any: nil is returned when no decay) is going to happen:

=> (def dl (into (decaying-list 5) (range 100)))
#'net.cgrand.decay/dl
=> (decay-loc dl)
[(93 95 97 98 99) (92 91 89 87 85 81 77 72 71 63 54 52 46 31 20 16)]

So here the decay is going to happen between 93 and 92 (the first items of the “left” and “right” sequences). By default the most recent one is kept.

=> (seq (conj dl 100))
(100 99 98 97 95 93 91 89 87 85 81 77 72 71 63 54 52 46 31 20 16)

Indeed 92 has been dropped. (Repeated calls to decay-loc always yield the same result.)

However we may elect to synthetize a new value rather than just keep one of the two candidates to decay. For example we can compute their average:

=> (seq (into (decaying-list 5 
                :collapse (fn [dl]
                            (let [[[l] [r]] (decay-loc dl)] 
                              (/ (+ l r) 2.0))))
          (range 100)))
(99 98 97 95.5 93.5 91.5 90 89 87.125 82.75 79.5 72.875 64.8125 60.875 58 50.732421875 27.0 17.75 11.25 2.8125)

Or something a bit more elaborate (and precise):

=> (defn stats [x]
     (if (map? x)
       x
       {:min x :max x :avg x :n 1}))
#'net.cgrand.decay/stats
=> (defn merge-stats [{mina :min maxa :max avga :avg na :n}
                      {minb :min maxb :max avgb :avg nb :n}]
     {:min (min mina minb)
      :max (max maxa maxb)
      :avg (/ (+ (* na avga) (* nb avgb)) (double (+ na nb)))
      :n (+ na nb)})
#'net.cgrand.decay/merge-stats
=> (seq (into (decaying-list 5 
                :collapse (fn [dl]
                            (let [[[l] [r]] (decay-loc dl)] 
                              (merge-stats (stats l) (stats r)))))
          (range 100)))
(99 98 97 {:min 92, :max 96, :avg 94.0, :n 5} 91 {:min 87, :max 90, :avg 88.5, :n 4} 86 85 84 83 {:min 80, :max 82, :avg 81.0, :n 3} {:min 78, :max 79, :avg 78.5, :n 2} 77 {:min 70, :max 76, :avg 73.0, :n 7} 69 68 {:min 49, :max 67, :avg 58.0, :n 19} {:min 41, :max 48, :avg 44.5, :n 8} {:min 36, :max 40, :avg 38.0, :n 5} 35 {:min 32, :max 34, :avg 33.0, :n 3} {:min 10, :max 31, :avg 20.5, :n 22} {:min 3, :max 9, :avg 6.0, :n 7} {:min 0, :max 2, :avg 1.0, :n 3})

Decaying lists can also maintain a state – that is updated after each conj or decay. Below the implementation, there is an example of state management.

You can also cap the length of the decaying list by using the :capacity option. A quick note on capacity: increasing the capacity by the half-life value (eg going from 1000 to 1050 when half-life is 50), doubles the range the decaying list remembers.

Destructuring, records, protocols and named arguments

optimization,pondering — cgrand, 8 May 2010 @ 12 h 19 min

Warning: this post is full of microbenchmarks so take it with a pinch of salt.

Destructuring a record

Field access through keyword-lookups on records is fast:

user=> (defrecord Foo [a b])
user.Foo
user=> (let [x (Foo. 1 2)] (dotimes [_ 5] (time (dotimes [_ 1e7] (let [a (:a x) b (:b x)] [a b])))))
"Elapsed time: 114.787424 msecs"
"Elapsed time: 102.568273 msecs"
"Elapsed time: 71.150593 msecs"
"Elapsed time: 72.217418 msecs"
"Elapsed time: 70.127489 msecs"
nil

But as far as I know destructure still emits (get x :k), let check:

user=> (let [x (Foo. 1 2)] (dotimes [_ 5] (time (dotimes [_ 1e7] (let [{:keys [a b]} x] [a b])))))
"Elapsed time: 968.616612 msecs"
"Elapsed time: 945.704133 msecs"
"Elapsed time: 911.290751 msecs"
"Elapsed time: 927.658125 msecs"
"Elapsed time: 916.796408 msecs"
nil

Actually it’s slightly slower than lookup on small maps:

user=> (let [x {:a 1 :b 2}] (dotimes [_ 5] (time (dotimes [_ 1e7] (let [{:keys [a b]} x] [a b])))))
"Elapsed time: 866.942377 msecs"
"Elapsed time: 746.273968 msecs"
"Elapsed time: 734.366239 msecs"
"Elapsed time: 729.346188 msecs"
"Elapsed time: 746.96393 msecs"
nil

One patch later

I patched destructure to emit keyword-lookups:

user=> (let [x (Foo. 1 2)] (dotimes [_ 5] (time (dotimes [_ 1e7] (let [{:keys [a b]} x] [a b])))))
"Elapsed time: 479.911064 msecs"
"Elapsed time: 488.895167 msecs"
"Elapsed time: 464.617431 msecs"
"Elapsed time: 480.944575 msecs"
"Elapsed time: 464.969779 msecs"
nil

It’s better but still slower than without destructuring. Let see what slows us:

user=> (macroexpand-1 '(let [{:keys [a b]} x] [a b]))
(let* [map__3838 x map__3838 (if (clojure.core/seq? map__3838) (clojure.core/apply clojure.core/hash-map map__3838) map__3838) b (:b map__3838) a (:a map__3838)] [a b])

My bet is on the if+seq? so I remove it:

user=> (let [x (Foo. 1 2)] (dotimes [_ 5] (time (dotimes [_ 1e7] (let* [map__3838 x map__3838 map__3838 b (:b map__3838) a (:a map__3838)] [a b])))))
"Elapsed time: 125.188397 msecs"
"Elapsed time: 103.041099 msecs"
"Elapsed time: 70.061558 msecs"
"Elapsed time: 70.793984 msecs"
"Elapsed time: 69.759146 msecs"
nil

This if+seq? allows for named arguments. I wonder if this behaviour should be an option of map-destructuring (eg {:keys [a b] :named-args true}). Anyway I had in mind that such a dispatch could be optimized with protocols.

Optimizing dispatch with protocols

user=> (defprotocol Seq (my-seq [this]) (my-seq? [this]))
Seq
user=> (extend-protocol Seq
clojure.lang.ISeq
(my-seq [this] (.seq this))
(my-seq? [this] true)
Object
(my-seq [this] (clojure.lang.RT/seq this))
(my-seq? [this] false))
nil

Let see how my-seq? compares to seq?

user=> (let [x (Foo. 1 2)] (dotimes [_ 5] (time (dotimes [_ 1e7] (let* [map__3838 x map__3838 (if (my-seq? map__3838) (clojure.core/apply clojure.core/hash-map map__3838) map__3838) b (:b map__3838) a (:a map__3838)] [a b])))))
"Elapsed time: 179.282982 msecs"
"Elapsed time: 161.781526 msecs"
"Elapsed time: 154.307042 msecs"
"Elapsed time: 155.567677 msecs"
"Elapsed time: 153.716604 msecs"
nil

Hence my-seq? is 3x faster than seq? which means that protocols are indeed speedy: yet another incredible piece of work by Rich Hickey!

I’m still exploring how low-level protocols fns behave in a concurrent setting, stay tuned!

Meanwhile don’t forget to register to the next European Clojure training session taught by Lau and me.

Graph Structured Stacks in Clojure

pondering — cgrand, 16 January 2010 @ 17 h 41 min

I am currently pondering how best to represent Graph Structured Stacks in Clojure. My first idea was to use maps.

So this GSS would be represented as:

{7 {3 {1 {0 {}}}, 
    4 {1 {0 {}}}, 
    5 {2 {0 {}}}}, 
 8 {6 {2 {0 {}}}}}

Please note that repeated maps in this representation would be shared by construction.

It is all well and good this representation allow me to easily perform stack ops but I want to also be able to traverse the stacks from the bottom. This means to make the graph cyclic so I thought that I need an adjency list or, more realistically, two indexed versions (maps of sets) of this list to be able to traverse the GSS in any direction. Or the map-based representation for direct traversal and an adjency map for reverse traversal.

To me the more boring part with this adjency lists is that I have to name the GSS nodes. Wait! There are natural identifiers for these nodes: their key-value pairs in the map-based representation! [2 {0 {}}] is a natural identifier for node #2 and [2 {0 {}}] the one for node #7.

Now I’m able to represent the GSS as:

[;; direct representation
 {7 {3 {1 {0 {}}}, 
     4 {1 {0 {}}}, 
     5 {2 {0 {}}}}, 
  8 {6 {2 {0 {}}}}}
 ;; adjency map for reverse traversal
 {nil #{[0 {}]},
  [0 {}] #{[1 {0 {}}] 
           [2 {0 {}}]},
  [1 {0 {}}] #{[3 {1 {0 {}}}]
               [4 {1 {0 {}}}]},
  [2 {0 {}}] #{[5 {2 {0 {}}}]
               [6 {2 {0 {}}}]},
  [3 {1 {0 {}}}] #{[7 {3 {1 {0 {}}}, 
                       4 {1 {0 {}}}, 
                       5 {2 {0 {}}}}]},
  [4 {1 {0 {}}}] #{[7 {3 {1 {0 {}}}, 
                       4 {1 {0 {}}}, 
                       5 {2 {0 {}}}}]},
  [5 {2 {0 {}}}] #{[7 {3 {1 {0 {}}}, 
                       4 {1 {0 {}}}, 
                       5 {2 {0 {}}}}]},
  [6 {2 {0 {}}}] #{[8 {6 {2 {0 {}}}}]}}]

Granted the serialization is verbose but all nodes are shared in memory by construction and I find this model rather simple.

Does anyone have other ideas on how to functionally implement such a structure?

(c) 2024 Clojure and me | powered by WordPress with Barecity