Optimal justification of text with Dynamic Programming

unsorted — cgrand, 29 August 2014 @ 11 h 46 min

This is an exercise in dynamic programming I found on /ftp.

The goal is to tightly arrange a given sequence of n words within page margins, maximizing overall neatness. To be more precise, we wish to minimize the sum, over all lines except the last, of the cubes of the number of blank characters at the end of each line. See the comments in the code for more details.

The algorithm has O(n^2) time and space complexities.

Below is my take on it and I believe the complexity to be O(n*width). I achieve linearity in the words count by laying out the text back to front: the key insight is that the optimal layout of some words only depends on the amount of space left on the current line, it does not depend on the layout of the words before them.

(defn layout [words width]
  (let [cat (fn [word {u :ugliness [l & ls] :lines}] ; adds the word to the start of the first line
              {:ugliness u :lines (cons (cons word l) ls)})
        br (fn [rem {u :ugliness ls :lines}] ; adds a break (creates a new line)
              {:ugliness (+ u (* rem rem rem)) :lines (cons () ls)})
        layout ; layout words, knowing that the first line shouldn't have more than rem characters
          (fn [layout words rem]
            (if-let [[word & ws] (seq words)]
              (cond
                (= rem width) ; blank line ahead
                  (cat word (layout ws (- rem (count word))))
                (< (count word) rem) ; enough room for a space and the current word
                  (cat word
                    (min-key :ugliness
                      (layout ws (- rem (count word) 1))
                      (br rem (layout ws width))))
                :else (br rem (layout words width)))
              {:ugliness 0 :lines (list ())}))
        mlayout (memoize layout)
        layout (fn self [words rem] (mlayout self words rem))]
    (:lines (layout words width))))

This is an exercise I prepared for a Lambda Next workshop but that we never used.

Macros, closures and unexpected object retention

unsorted — cgrand, 11 September 2013 @ 12 h 37 min

The common advice about macros is that they should emit as little code as possible and delegate to ancillary functions as soon as possible. Here is an example from clojure.java.jdbc:

(defmacro transaction
  [& body]
  `(transaction* (fn [] ~@body)))

I still think this is good advice but it has unintended consequences. The problem with this piece of code is that all closed-over objects in body are going to be retained longer than expected, longer than they would have been retained if the macro had emitted all the logic implemented in transaction* instead of delegating to it. (See this discussion as an example of issues created by such code.)

The closure object has references to all closed-over objects and since a closure can be called many times, it can’t get rid of them. So the only time where they are going to be garbage collectible is once the closure itself becomes collectible… and a closure can’t be collected while it’s executing.

Helpfully there’s a low-level feature for that:

(defmacro transaction
  [& body]
  `(transaction* (^:once fn* [] ~@body)))

It instructs the compiler that the closure is one-shot and that closed-over references should be cleared, thus allowing referenced objects to be garbage collected before the closure returns.

This problem is not specific to macros but can easily be solved in most cases: the closure is an implementation detail and the macro writer knows enough about its life-cycle to fix it. However any regular closure (fn or reify) is going to prevent closed-overs to be garbage-collected while one of its (java) methods is running because the closure is referenced by the stack.

During the last LambdaNext workshop a delegate stumbled on such a case while experimenting with reducers (and incidentally it made me understand a memory issue I worked around last year):

=> (time (reduce + 0 (map identity (range 1e8))))
"Elapsed time: 5729.579 msecs"
4999999950000000
=> (time (reduce + 0 (r/map identity (range 1e8))))
;; Interrupting...
Expression was interrupted: null

(Depending on your memory settings, you may have to tweak the length of the range to exhibit the problem; more details here)

If one modifies reducers to not use (java) methods but external extensions:

(in-ns 'clojure.core.reducers)

(defrecord Folder [coll xf])

(defn folder
  "Given a foldable collection, and a transformation function xf,
  returns a foldable collection, where any supplied reducing
  fn will be transformed by xf. xf is a function of reducing fn to
  reducing fn."
  {:added "1.5"}
  ([coll xf]
     (Folder. coll xf)))

(extend-type Folder
      clojure.core.protocols/CollReduce
      (coll-reduce [fldr f1]
                   (clojure.core.protocols/coll-reduce (:coll fldr) ((:xf fldr) f1) (f1)))
      (coll-reduce [fldr f1 init]
                   (clojure.core.protocols/coll-reduce (:coll fldr) ((:xf fldr) f1) init))

      CollFold
      (coll-fold [fldr n combinef reducef]
                 (coll-fold (:coll fldr) n combinef ((:xf fldr) reducef))))

Then the problem disappears:

(in-ns 'user)
=> (time (reduce + 0 (r/map identity (range 1e8))))
"Elapsed time: 4437.012 msecs"
4999999950000000

This is because the protocol methods is not a java method of the reducer object anymore and thus it can be reclaimed while the (protocol) method is executing.

So next time you have a memory issue, look for closures tying the life-cycle of their closed-overs to theirs!

Tarjan’s strongly connected components algorithm

unsorted — cgrand, 18 March 2013 @ 19 h 48 min

I dislike algorithms that are full of indices and mutations. Not because they are bad but because I always have the feeling that the core idea is buried. As such, Tarjan’s SCC algorithm irked me.

So I took the traditional algorithm, implemented it in Clojure with explicit environment passing, then I replaced indices by explicit stacks (thanks to persistence!) and after some tweaks, I realized that I’ve gone full circle and could switch to stacks lengths instead of stacks themselves and get rid of the loop. However the whole process made the code cleaner to my eye. You can look at the whole history.

Here is the resulting code:

(defn tarjan 
  "Returns the strongly connected components of a graph specified by its nodes
   and a successor function succs from node to nodes.
   The used algorithm is Tarjan's one."
  [nodes succs]
  (letfn [(sc [env node]
            ; env is a map from nodes to stack length or nil,
            ; nil means the node is known to belong to another SCC
            ; there are two special keys: ::stack for the current stack 
            ; and ::sccs for the current set of SCCs
            (if (contains? env node)
              env
              (let [stack (::stack env)
                    n (count stack)
                    env (assoc env node n ::stack (conj stack node))
                    env (reduce (fn [env succ]
                                  (let [env (sc env succ)]
                                    (assoc env node (min (or (env succ) n) (env node)))))
                          env (succs node))]
                (if (= n (env node)) ; no link below us in the stack, call it a SCC
                  (let [nodes (::stack env)
                        scc (set (take (- (count nodes) n) nodes))
                        ; clear all stack lengths for these nodes since this SCC is done
                        env (reduce #(assoc %1 %2 nil) env scc)]
                    (assoc env ::stack stack ::sccs (conj (::sccs env) scc)))
                  env))))]
    (::sccs (reduce sc {::stack () ::sccs #{}} nodes))))

As always, if you need some short-term help with Clojure (code review, consulting, training etc.), contact me!

Decaying lists: log scale for lists

pondering — cgrand, 12 February 2013 @ 13 h 44 min

The concept of exponentially decaying lists by David Barbour piqued my interest and I implemented it. It can be summed up as: log scale for lists!

As for log scale, decaying lists allow to manage large range of values and thus to get a better understanding of the data.

So a decaying list grows logarithmically with the number of conjed items. It follows that some items are dropped when others are inserted. Let’s see:

=> (seq (into (decaying-list) (range 100)))
(99 98 97 96 95 94 92 91 90 89 88 87 86 84 83 80 70 65 61 59 57 49 44 30 29 27 25 23 22 18 12 4 0)

So most of the recent items are there but the older elements have a greater probability to have been dropped.

A decaying list is parametrized by its half-life. The default half-life is 20, it means that an item in the list has one chance out of two to “survive” the insertion of 20 other items.

=> (seq (into (decaying-list 5) (range 100))) ; a half-life of 5 conjs
(99 98 97 96 95 94 93 91 83 81 79 78 68 61 57 54 52 31 15)

You can know where the next decay (if any: nil is returned when no decay) is going to happen:

=> (def dl (into (decaying-list 5) (range 100)))
#'net.cgrand.decay/dl
=> (decay-loc dl)
[(93 95 97 98 99) (92 91 89 87 85 81 77 72 71 63 54 52 46 31 20 16)]

So here the decay is going to happen between 93 and 92 (the first items of the “left” and “right” sequences). By default the most recent one is kept.

=> (seq (conj dl 100))
(100 99 98 97 95 93 91 89 87 85 81 77 72 71 63 54 52 46 31 20 16)

Indeed 92 has been dropped. (Repeated calls to decay-loc always yield the same result.)

However we may elect to synthetize a new value rather than just keep one of the two candidates to decay. For example we can compute their average:

=> (seq (into (decaying-list 5 
                :collapse (fn [dl]
                            (let [[[l] [r]] (decay-loc dl)] 
                              (/ (+ l r) 2.0))))
          (range 100)))
(99 98 97 95.5 93.5 91.5 90 89 87.125 82.75 79.5 72.875 64.8125 60.875 58 50.732421875 27.0 17.75 11.25 2.8125)

Or something a bit more elaborate (and precise):

=> (defn stats [x]
     (if (map? x)
       x
       {:min x :max x :avg x :n 1}))
#'net.cgrand.decay/stats
=> (defn merge-stats [{mina :min maxa :max avga :avg na :n}
                      {minb :min maxb :max avgb :avg nb :n}]
     {:min (min mina minb)
      :max (max maxa maxb)
      :avg (/ (+ (* na avga) (* nb avgb)) (double (+ na nb)))
      :n (+ na nb)})
#'net.cgrand.decay/merge-stats
=> (seq (into (decaying-list 5 
                :collapse (fn [dl]
                            (let [[[l] [r]] (decay-loc dl)] 
                              (merge-stats (stats l) (stats r)))))
          (range 100)))
(99 98 97 {:min 92, :max 96, :avg 94.0, :n 5} 91 {:min 87, :max 90, :avg 88.5, :n 4} 86 85 84 83 {:min 80, :max 82, :avg 81.0, :n 3} {:min 78, :max 79, :avg 78.5, :n 2} 77 {:min 70, :max 76, :avg 73.0, :n 7} 69 68 {:min 49, :max 67, :avg 58.0, :n 19} {:min 41, :max 48, :avg 44.5, :n 8} {:min 36, :max 40, :avg 38.0, :n 5} 35 {:min 32, :max 34, :avg 33.0, :n 3} {:min 10, :max 31, :avg 20.5, :n 22} {:min 3, :max 9, :avg 6.0, :n 7} {:min 0, :max 2, :avg 1.0, :n 3})

Decaying lists can also maintain a state – that is updated after each conj or decay. Below the implementation, there is an example of state management.

You can also cap the length of the decaying list by using the :capacity option. A quick note on capacity: increasing the capacity by the half-life value (eg going from 1000 to 1050 when half-life is 50), doubles the range the decaying list remembers.

From lazy seqs to reducers and back

unsorted — cgrand, 11 February 2013 @ 17 h 27 min

Sometimes in a seq pipeline, you know that some intermediate results are, well, intermediate and as such don’t need to be persistent but, on the whole, you still need the laziness.

You can’t always opt for reducers because while being non-persistent (and thus faster), they imply getting rid of laziness.

So you can’t mix and match transformations on sequences and on reducers when you care for laziness. Or, wait!, maybe you can.

At core, reducers are just a clever way to compose big functions while giving the user the illusion of manipulating collections. So we should be able to apply a reducer pipeline if we get hold of the composed function. This is exactly what the below seq-seq function does:

(defn reverse-conses 
  ([s tail] 
    (if (identical? (rest s) tail)
      s
      (reverse-conses s tail tail)))
  ([s from-tail to-tail]
    (loop [f s b to-tail]
      (if (identical? f from-tail)
        b
        (recur (rest f) (cons (first f) b))))))

(defn seq-seq [f s]
  (let [f1 (reduce #(cons %2 %1) nil 
              (f (reify clojure.core.protocols.CollReduce
                   (coll-reduce [this f1 init]
                     f1))))]
    ((fn this [s]
       (lazy-seq 
         (when-let [s (seq s)]
           (let [more (this (rest s)) 
                 x (f1 more (first s))]
             (if (reduced? x)
               (reverse-conses @x more nil)
               (reverse-conses x more)))))) s)))

(defmacro seq->> [s & forms]
  `(seq-seq (fn [n#] (->> n# ~@forms)) ~s))

Note that the captured function (f1) may be impure, so don’t share it!

Now one can specifies non-persistent lazy seq pipelines:

=> (seq->> (range) (r/map str) (r/take 25) (r/drop 5))
("5" "6" "7" "8" "9" "10" "11" "12" "13" "14" "15" "16" 
 "17" "18" "19" "20" "21" "22" "23" "24")

To prove laziness is at play here:

=> (take 2 (seq->> (range) (r/map #(str (doto % prn))) (r/take 25) (r/drop 5)))
(0
1
2
3
4
5
6
"5" "6")

Of course seq-seq could be made to support chunked sequences and, if you try to play with it, beware of r/mapcat.

Update:

I added reverse-conses to account for the fact that when several items are added during a single call to f1 (eg during a r/mapcat “step”) they were added in the wrong order – if only everything was more set-like.

Clojure unconference, somewhere in France, early Spring.

unsorted — cgrand, 21 December 2012 @ 14 h 25 min

Some people are willing to set up a Clojure unconference in France in early Spring. Follow the clojure-fr mailing list for further discussion.

Extensible macros (A flatter cond follow-up)

unsorted — cgrand, 21 September 2012 @ 8 h 51 min

I pushed Utils out. It features reduce-by but also variations on cond (the ones introduced in the comments of A Flatter Cond) and if-let.

cond supports the following operators: :let, :when, :when-let and vectors in test position are treated as the binding form to a if-let (thus setting the local bindings) for the “then” expression not for the rest of the cond.

if-let supports :let to introduce new bindings without testing their expression for truth.

One notable twist to these macros is that thanks to net.cgrand.xmacros they are extensible: new operators can safely be participated to the macros (proper namespacing is enforced). See the cond implementation, abbreviated below:

(defmacro cond
 "An extensible variation on cond. Trailing else is supported.
keywords in test position are treated as operators.

Operators impls signature is [rhs else] where rhs is the \"then\"
expression next to the operator."
  [& clauses]
  (when-let [[op rhs & more-clauses] (seq clauses)]
    (if (next clauses)
      (x/expand-op op rhs `(cond ~@more-clauses))
      test)))

(x/defdefault-op cond [test-expr then else]
  (if (vector? test-expr)
    `(if-let ~test-expr ~then ~else)
    `(if ~test-expr ~then ~else)))

(x/defop cond :let
  "Introduces local bindings."
  [bindings cont]
  `(let ~bindings ~cont))

(x/defop cond :when
  "Short-circuits the rest of the cond if false"
  [test-expr cont]
  `(when ~test-expr ~cont))

(x/defop cond :when-let
 "Short-circuits the rest of the cond if false and introduces local
bindings."
  [bindings cont]
  `(when-let ~bindings ~cont))

Follow-up: A world in a ref

unsorted — cgrand, @ 8 h 35 min

I recently published Megaref a refined (and working) version of the idea described in A World in a Ref.

New and notesworthy:

  • actual working code,
  • the concurrency (number of guards) can be tuned at runtime,
  • tuning happens through an uniform interface (get-options, set-option!),
  • subrefs (and backward compatible replacements for usual STM functions) allow to easily migrate “polyref” code to megaref (see the conversion of Rich Hickey’s ant demo — mind the typo in the commit message).

Feedback welcome!

Fair conjunction: status report

unsorted — cgrand, 6 April 2012 @ 15 h 54 min
Shameless plugs:
Clojure Programming is out! Available now as an ebook and in a few days in paper!
Come to my Clojure training in Geneva it will be in a mixed language (French/English) setting — for an English-only training stay tuned!

Making conjunction fair is the reason why I messed up with core.logic, and this post document the approach I explore in the fair-conj2 branch.

In The Reasoned Scheduler I said that conjunction is easy: you wait for the first stream to provide a answer that you pass to the next goal. It looks easy but it suffers from being ordered: when the first goal diverges (produces an infinite stream of results or never halts) and the next goal is known to fail, the conjuction diverges too! This is unfair conjunction.

Changing the order of the goals would fix this particular example but like noted by William Byrd in Relational Programming in miniKanren: Techniques, Applications, and Implementations (page 53 and 54):

However, reordering goals has its disadvantages. For many programs, no ordering of goals will result in finite failure (see the remaining example in this chapter). Also, by committing to a certain goal ordering we are giving up on the declarative nature of relational programming: we are specifying how the program computes, rather than only what it computes. For these reasons, we should consider alternative solutions.

[T]he goal ordering we picked might diverge when we ask for a second answer, while the other ordering may fail finitely after producing a single answer.

So, to make conjunction fair (commutative), one has to not commit to an ordering of goal, which means that both goals must be executed in parallel (interleaved execution or even true parallelism). As soon as one of the goal produces an answer, this substitution has to be communicated to the other threadgoal so as to narrow its search space.

As a consequence of the disappearance of ordering, the distinction between a goal and a stream disappears too. I introduced a Search abstraction. A Search may yield a substitution (or nil)
and progress with step. It sounds awfully familiar with first and next.

Let’s name plus, join and narrow respectively the disjunction, conjunction and narrowing operators. Conjunction is distributive over disjunction as usual and narrowing is distributive over both.

Here is how the key expression is derived:

(join a b)
(join (plus (yield a) (step a)) b) ; decompose a 
(plus (join (yield a) b) (join (step a) b)) ; distribute join
(plus (narrow b (yield a)) (join (step a) b)) ; joining with a substitution (or nil) is narrowing
(plus (narrow b (yield a)) (join b (step a))) ; commute the operands of join for interleaving

This expression can be found nearly as-is in the code.

Narrowing is performed lazily: when one steps a suspended narrowing (a Narrow instance), the narrowing is propagated (thanks to its distributivity) to the children searches of its argument.

To make narrowing efficient, I introduced the min-yield function which, given a Search, returns the minimal substitution yielded by this search. That is: if this Search ever yield results they will be extensions of the min-yield substitution. This allows to discard a Search in its entirety when the narrowing substitution and the min-yield substitutions are not compatible!

To compute min-yield substitutions, the narrowing process leaves nodes (Scope instances) in the computation tree labeled with the narrowing substitution. By default min-yield returns empty-s the empty substitution.

This is the gist of my approach to fair conjunction. In a subsequent post, I’ll cover heuristics I use to prioritize execution of Searches.

Practically, what does it mean? It means that logic programs are easier to write, it also means that, at the moment, a program written with unfair conjunction in mind run slower with fair conjunction. Interestingly, under fair conjunction, naive pluso (p. 63) seems refutationally complete.

I intend to explore how constraints can be added to this branch and how it will impact (positively) performance.

If you feel like dabbling with this branch, I’ll be more than happy to hear your feedback. Thanks!

A poor man’s interval tree

unsorted — cgrand, 16 March 2012 @ 22 h 49 min

Sorted collections can be put to good (ab)use, here I define interval-lt a partial order on intervals and points — where an interval is a vector [from to] (from inclusive, to exclusive; they can be nil to denote infinity) and a point is [n n].

(defn interval-lt
  [[a b] [c d]]
  (boolean (and b c 
                (if (= a b)
                  (neg? (compare b c))
                  (<= (compare b c) 0)))))

Using this partial order and a sorted-map, I can build an interval tree:

(def empty-interval-map 
  (sorted-map-by interval-lt [nil nil] #{}))

(defn- isplit-at [interval-map x]
  (if x
    (let [[[a b :as k] vs] (find interval-map [x x])]
      (if (or (= a x) (= b x))
        interval-map
        (-> interval-map (dissoc k) (assoc [a x] vs [x b] vs))))
    interval-map))

(defn- ialter [interval-map from to f & args]
  (let [interval-map (-> interval-map (isplit-at from) (isplit-at to))
        kvs (for [[r vs] 
                  (cond 
                    (and from to)
                    (subseq interval-map >= [from from] < [to to])
                    from
                    (subseq interval-map >= [from from])
                    to
                    (subseq interval-map < [to to])
                    :else
                    interval-map)]
              [r (apply f vs args)])]
    (into interval-map kvs)))

(defn iassoc [interval-map from to v]
  (ialter interval-map from to conj v))

(defn idissoc [interval-map from to v]
  (ialter interval-map from to disj v))

(defn iget [interval-map x]
  (get interval-map [x x]))

Demo, let’s state who’s present and at which time:

(-> empty-interval-map
  (iassoc 9 17 "Ethel")
  (iassoc 7 12 "Lucy")
  (iassoc 11 20 "Fred"))
; {[nil 7] #{}, [7 9] #{"Lucy"}, 
;  [9 11] #{"Ethel" "Lucy"}, 
;  [11 12] #{"Ethel" "Fred" "Lucy"}, 
;  [12 17] #{"Ethel" "Fred"}, 
;  [17 20] #{"Fred"}, 
;  [29 nil] #{}}

And now one can query by time:

(iget *1 8)
; #{"Lucy"}
(iget *2 15)
; #{"Ethel" "Fred"}
(iget *3 20)
; #{}

Sure, this is not the best interval tree implementation ever, it suffers from severe fragmentation and not ideal update complexity. However if what only matters is lookup, it works nicely in O(log n) — each iassoc adds at most two entries to the map so for n intervals you have at most 1+2n entries, lookup is O(log (2n+1)), that is O(log n).

Next Page »
(c) 2014 Clojure and me | powered by WordPress with Barecity