Regex Optimization in Clojure

Earlier this week, after responding to a post on ClojureVerse, I got curious about re-implementing the basics of Emacs’s regexp-opt function in Clojure. I thought it would be a fun little coding exercise so I decided to take a stab at it during a few spare minutes in my day and was very pleased with the concision and clarity of the result.

Emacs’s regexp-opt takes a list of strings and produces an optimized regular expression matching all of the passed strings. A non-optimized solution to this problem would be to simply combine all of the possibilities into a regex alternation:

(def sample-opts ["AB" "ABOUT" "APPLE" "BOTTLE"])

;; (Extracting this fn to re-use in a later example.)
(defn- re-alt-str
  "Build a string representing regex alternation."
  [opts]
  (->> opts
       (str/join "|")
       (format "(?:%s)")))

(defn naive-re-opt [opts]
  (re-pattern (re-alt-str opts)))

(naive-re-opt sample-opts)
;; => #"(?:AB|ABOUT|APPLE|BOTTLE)"

(re-find (naive-re-opt sample-opts)
         "PLEASE GIVE ME AN APPLE")
;; => "APPLE"

This approach works perfectly fine, but isn’t optimal because in searching for a string like “APPLE”, the regex engine initially finds a match against the A in “AB”, then realizes the B doesn’t match before moving on to “ABOUT” and doing the same partial match dance again before trying “APPLE”1.

Instead, we can compute an optimized regular expression where the A is only matched once and only then do we continue on to explore the possible suffixes following A (viz. “B”, “BOUT”, “PPLE”). We also want to repeat this logic through the rest of the pattern text, only matching the B once.

How can we use Clojure to tackle this problem?

The algorithm is actually quite simple:

First we build a graph (tree) where each edge represents successive letters in the strings of opts:

-> A -> B
          -> O -> U -> T
     -> P -> P -> L -> E
-> B -> O -> T -> T -> L -> E

(Newlines represent terminators here, which need to be explicitly modeled in the graph so that “AB” is considered a match and not just a prefix on the way to matching “ABOUT”.)

Then we simply collapse the tails into literal strings, and the branches into alternations:


(defn better-re-opt [opts]
  (->> opts
       ;; Build graph
       (reduce (fn [graph s]
                 (assoc-in graph (conj (vec s) nil) nil))
               {})
       ;; Reduce graph to string
       (walk/postwalk
        (fn [x]
          (cond (not (map? x))  x
                (= 1 (count x)) (apply str (first x))
                :else           (re-alt-str
                                 (map #(apply str %) x)))))
       re-pattern))

(better-re-opt sample-opts)
;; => #"(?:A(?:B(?:|OUT)|PPLE)|BOTTLE)"

(Excuse the dense indentation, I’m trying to squeeze this into 70 columns to avoid horizontal scrolling.)

That’s it! About ten lines of code to solve this whole problem, and maybe 20 minutes of work for me from idea to finished feature.

My favorite thing about this implementation is how easy Clojure makes it to build the graph. I initially went into this thinking of manually consing up the graph before realizing assoc-in could do all of the work for me. Essentially we’re just doing something like this:

(-> {}
    (assoc-in [\A \B nil] nil)
    (assoc-in [\A \B \O \U \T nil] nil)
    (assoc-in [\A \P \P \L \E nil] nil)
    (assoc-in [\B \O \T \T \L \E nil] nil))
;; =>
{\A {\B {nil nil,
         \O {\U {\T {nil nil}}}},
     \P {\P {\L {\E {nil nil}}}}},
 \B {\O {\T {\T {\L {\E {nil nil}}}}}}}

Notice the similarity to the graph we drew above?

Those nil terminators may look funky, but I chose that approach intentionally because (str nil) ;=> "", which makes the code to reduce the graph to a string very compact.

To do that reduction, we perform a post-order traversal of the nodes of the graph (convenient as all get-out with clojure.walk/postwalk), returning non-map values unchanged and only operating when we’re looking at a map. For maps with a single key-value pair, we use str to concatenate the key (the head character) and the value (the reduced tail string). Referring to this operation as f:

(f {nil nil}) ; <=> (str nil nil) => ""
(f {\A "BC"}) ; <=> (str \A "BC") => "ABC"

When we have multiple key-value pairs, we reduce each pair in the same fashion and then combine them into a regex alternation just like in the naive version.

The reduction (here, for a smaller graph), proceeds something like this:

{\A {\B {\C {nil nil},
         \U {\T {nil nil}}}}}

{\A {\B {\C {nil nil},
         \U {\T ""}}}}

{\A {\B {\C {nil nil},
         \U "T"}}}

{\A {\B {\C "",
         \U "T"}}}

{\A {\B "(?:C|UT)"}}

{\A "B(?:C|UT)"}

"AB(?:C|UT)"

This is the kind of scenario where all the features of Clojure come together so elegantly to make a succinct, beautiful solution.


Full code listing for ease of copy / paste:

(ns camdez.re-opt
  (:require [clojure.walk :as walk]
            [clojure.string :as str]))

(defn- re-alt-str [opts]
  (if (next opts)
    (str "(?:" (str/join "|" opts) ")")
    (first opts)))

(defn re-opt
  "Build an optimized regular expression to match any of the literal
  strings in `opts` by unifying common prefixes."
  [opts]
  (->> opts
       (reduce (fn [graph s]
                 (assoc-in graph (conj (vec s) nil) nil))
               {})
       (walk/postwalk (fn [x]
                        (if (map? x)
                          (re-alt-str (map #(apply str %) x))
                          x)))
       re-pattern))

  1. In practice, our regex engine may be smart enough to avoid these operations through internal optimization, but that’s why this is just a coding exercise. ↩︎