WideFinder 2 in Clojure (naive port from Ruby)

wide finder 2 — cgrand, 13 June 2008 @ 17 h 31 min

I ported the reference implementation of Wide Finder 2 from Ruby to Clojure nearly line by line.
On my box, this code is more than 25% faster than the original Ruby when processing 10M lines (2’45” to 3’45”) — but Ruby is faster up to 100k lines.

(def u-hits)
(def u-bytes)
(def s404s)
(def clients)
(def refs)

(defmacro acc [h k v]
  `(set! ~h (assoc ~h ~k (+ (get ~h ~k 0) ~v))))

(defn top [n h]
  (take n (sort #(- (val %2) (val %1)) h)))

(defn record [client u bytes ref]
  (acc u-bytes u bytes)
  (when (re-matches #"^/ongoing/When/\\d\\d\\dx/\\d\\d\\d\\d/\\d\\d/\\d\\d/[^ .]+$" u)
    (acc u-hits u 1)
    (acc clients client 1)
    (when-not (or (= ref "\"-\"") (re-find #"^\"http://www.tbray.org/ongoing/" ref)
      (acc refs (subs ref 1 (dec (count ref))) 1))))) ; lose the quotes

(defn printf [#^String fmt & args]
  (let [f (java.util.Formatter. *out*)]
     (.format f (. java.util.Locale ENGLISH) fmt (to-array args))))

(defn report 
  ([label hash] (report label hash false))
  ([label hash shrink]
    (println (str "Top " label ":"))
    (let [fmt (if shrink " %9.1fM: %s\n" " %10d: %s\n")]
      (doseq [key val] (top 10 hash)
        (let [key (if (< 60 (count key)) (str (subs key 0 60) "...") key)
              val (if shrink (/ val 1024 1024) val)]
          (printf fmt val key))))))

(binding [u-hits {} u-bytes {} s404s {} clients {} refs {}]
  (doseq line (-> (. System in) (java.io.InputStreamReader. "US-ASCII") java.io.BufferedReader. line-seq)
    (let [f (.split #"\\s+" line)]
      (when (= "\"GET" (get f 5))
        (let [[client u status bytes ref] (map #(get f %) [0 6 8 9 10])]
          (cond
            (= "200" status) (record client u (.parseInt Integer bytes) ref)
            (= "304" status) (record client u 0 ref)
            (= "404" status) (acc s404s u 1))))))

  (print (count u-hits) "resources," (count s404s) "404s," (count clients) "clients\n\n")

  (report "URIs by hit" u-hits)
  (report "URIs by bytes" u-bytes true)
  (report "404s" s404s)
  (report "client addresses" clients)
  (report "referrers" refs))

My next post will show how one can achieve some parallelization without altering much the logic:

(<ins>p</ins>doseq line (-> (. System in) (java.io.InputStreamReader. "US-ASCII") java.io.BufferedReader. line-seq)
    <ins>[u-hits (merge-with +), u-bytes (merge-with +), s404s (merge-with +), clients (merge-with +), refs (merge-with +)]</ins>

0 Comments »

No comments yet.

RSS feed for comments on this post. TrackBack URI

Leave a comment

(c) 2024 Clojure and me | powered by WordPress with Barecity