Regex Optimization in Clojure
Earlier this week, after responding to a post on
ClojureVerse, I got curious about re-implementing the basics
of Emacs’s regexp-opt
function in Clojure. I thought it would be a
fun little coding exercise so I decided to take a stab at it during a
few spare minutes in my day and was very pleased with the concision
and clarity of the result.
Emacs’s regexp-opt
takes a list of strings and produces an optimized
regular expression matching all of the passed strings. A
non-optimized solution to this problem would be to simply combine
all of the possibilities into a regex alternation:
(def sample-opts ["AB" "ABOUT" "APPLE" "BOTTLE"])
(defn- re-alt-str
"Build a string representing regex alternation."
[opts]
(if (next opts)
(str "(?:" (str/join "|" opts) ")")
(first opts)))
(defn naive-re-opt [opts]
(re-pattern (re-alt-str opts)))
(naive-re-opt sample-opts)
;; => #"(?:AB|ABOUT|APPLE|BOTTLE)"
(re-find (naive-re-opt sample-opts)
"PLEASE GIVE ME AN APPLE")
;; => "APPLE"
This approach works perfectly fine, but isn’t optimal because in searching for a string like “APPLE”, the regex engine initially finds a match against the A in “AB”, then realizes the B doesn’t match before moving on to “ABOUT” and doing the same partial match dance again before trying “APPLE”1.
Instead, we can compute an optimized regular expression where the A is only matched once and only then do we continue on to explore the possible suffixes following A (viz. “B”, “BOUT”, “PPLE”). We also want to repeat this logic through the rest of the pattern text, only matching the B once.
How can we use Clojure to tackle this problem?
The algorithm is actually quite simple:
First we build a graph (tree) where each edge represents successive
letters in the strings of opts
:
Note: terminators need to be explicitly modeled in the graph so that “AB” is considered a match and not just a prefix on the way to matching “ABOUT”.
Then we simply collapse the tails into literal strings, and the branches into alternations:
(defn better-re-opt [opts]
(->> opts
;; Build graph
(reduce (fn [graph s]
(assoc-in graph (conj (vec s) nil) nil))
{})
;; Reduce graph to string
(walk/postwalk (fn [x]
(if (map? x)
(re-alt-str (map #(apply str %) x))
x)))
re-pattern))
(better-re-opt sample-opts)
;; => #"(?:A(?:B(?:|OUT)|PPLE)|BOTTLE)"
(Excuse the dense indentation, I’m trying to squeeze this into 70 columns to avoid horizontal scrolling.)
That’s it! About ten lines of code to solve this whole problem, and maybe 20 minutes of work for me from idea to finished feature.
My favorite thing about this implementation is how easy Clojure makes
it to build the graph. I initially went into this thinking of
manually cons
ing up the graph before realizing assoc-in
could do
all of the work for me. Essentially we’re just doing something like
this:
(-> {}
(assoc-in [\A \B nil] nil)
(assoc-in [\A \B \O \U \T nil] nil)
(assoc-in [\A \P \P \L \E nil] nil)
(assoc-in [\B \O \T \T \L \E nil] nil))
;; =>
{\A {\B {nil nil,
\O {\U {\T {nil nil}}}},
\P {\P {\L {\E {nil nil}}}}},
\B {\O {\T {\T {\L {\E {nil nil}}}}}}}
Notice the similarity to the graph we drew above?
Those nil
terminators may look funky, but I chose that approach
intentionally because (str nil) ;=> ""
, which makes the code to
reduce the graph to a string very compact.
To do that reduction, we perform a post-order traversal of the nodes
of the graph (convenient as all get-out with clojure.walk/postwalk
),
returning non-map values unchanged and only operating when we’re
looking at a map. For maps with a single key-value pair, we use str
to concatenate the key (the head character) and the value (the reduced
tail string). Referring to this operation as f
:
(f {nil nil}) ; <=> (str nil nil) => ""
(f {\A "BC"}) ; <=> (str \A "BC") => "ABC"
When we have multiple key-value pairs, we reduce each pair in the same fashion and then combine them into a regex alternation just like in the naive version.
The reduction (here, for a smaller graph), proceeds something like this:
{\A {\B {\C {nil nil},
\U {\T {nil nil}}}}}
{\A {\B {\C {nil nil},
\U {\T ""}}}}
{\A {\B {\C {nil nil},
\U "T"}}}
{\A {\B {\C "",
\U "T"}}}
{\A {\B "(?:C|UT)"}}
{\A "B(?:C|UT)"}
"AB(?:C|UT)"
This is the kind of scenario where all the features of Clojure come together so elegantly to make a succinct, beautiful solution.
Full code listing for ease of copy / paste:
(ns camdez.re-opt
(:require [clojure.walk :as walk]
[clojure.string :as str]))
(defn- re-alt-str
"Build a string representing regex alternation."
[opts]
(if (next opts)
(str "(?:" (str/join "|" opts) ")")
(first opts)))
(defn re-opt
"Build an optimized regular expression to match any of the literal
strings in `opts` by unifying common prefixes."
[opts]
(->> opts
(reduce (fn [graph s]
(assoc-in graph (conj (vec s) nil) nil))
{})
(walk/postwalk (fn [x]
(if (map? x)
(re-alt-str (map #(apply str %) x))
x)))
re-pattern))
In practice, our regex engine may be smart enough to avoid these operations through internal optimization, but that’s why this is just a coding exercise. ↩︎