Practical Data Coercion with Prismatic/schema

Thu, Aug 27, 2015 · Tagged Clojure · 10 minute read

If you follow me on Twitter, you probably know I’m a big fan of Prismatic’s schema library, which gives us a convenient way to validate and enforce the format of data in our Clojure applications. I use schema extensively both to provide some of the comfort / confirmation of a static type system, and to enforce run-time contracts for data coming off the wire.

But a problem quickly arises when we’re enforcing contracts over data drawn from an external format like JSON: the range of types available in JSON is limited compared to what we’re used to in Clojure, and might not include some of the types we’ve used in our schemas, leaving our schemas impossible to satisfy. Note that I’m not necessarily talking about anything exotic—simple things like sets, keywords, and dates are missing. The situation is even worse if we’re talking about validating command line parameters, where everything is a string regardless of if it logically represents a number, an enumeration value, or a URL.

What are we to do? Try to walk this data of unknown format, which is perhaps nested with optional components, transforming certain bits, and then running the result through our schema validation? That sounds ugly. And what do those error messages look like when it doesn’t match? Or we could validate that our (say) “date” parameters are present and are strings in a format that looks like we could parse it, then transform the data (which is at least in a known format now), and then validate it again? Obviously that’s less than ideal. And we’re going to end up with a proliferation of schemas which differ only in predictable ways—e.g. “params come in as a hash of two date-like strings, then get transformed to a hash of two dates”.

Fortunately for us, the fine folks at Prismatic must have run into this before we did, and thus they provided a fine solution in the form of schema-driven data transformations, which allow us to say “here are all the (consistent, well-defined) tricks you can use to beat this data into the right format—could you make it validate? And what did that resultant, valid data look like?” The official docs are good, and this blog post contains a wealth of information, but I found myself struggling to understand certain parts of the documentation until I’d read some of the implementation details and struggled through some coercion code of my own¹. My goal here is to provide a practical example of how to use schema’s coercions so you can hit the ground running.

An Illustrative Example

Pretend we’re writing a command-line tool to download users’ tweets and output them to a local archive in either plain text or JSON format. We’ll also allow setting a date indicating the earliest tweets we want to fetch in case we don’t need every tweet the users have ever written. Oh, and Ops wants our configuration to be done via JSON. Something, something, Docker.

Configuring the application is pretty simple: we’ll need a set of usernames (strings), a date, and a keyword indicating the output format:²

(ns camdez.blog.coerce
  (:require [clojure.data.json :as json]
            [clojure.instant :refer [read-instant-date]]
            [clojure.java.io :as io]
            [schema.coerce :as coerce]
            [schema.core :as s]
            [schema.utils :as s-utils])
  (:import java.util.Date))

(def Config
  {:users  #{s/Str}
   :after  (s/maybe Date)
   :format (s/enum :txt :json)})

If it seemed odd earlier when I said we might use types in our schema that JSON doesn’t support—why not just not do that?—hopefully this clears things up; everything used above is pretty standard Clojure data modeling—and none of it works in JSON. To illustrate, let’s try loading our config from a JSON-format file.

Here’s a fairly natural JSON representation of our configuration:

{
  "users":  ["horse_ebooks", "swiftonsecurity"],
  "after":  "2015-01-01T00:00:00.000Z",
  "format": "txt"
}

Let’s drop that into a config.json file, and then add a quick function to encapsulate the repetitive elements of what we’re going to cover:

(def config-file-name "config.json")

(defn load-config-file []
  (-> config-file-name
      (io/reader)
      (json/read :key-fn keyword)))

A First Attempt, Sans Coercion

Now what we want to do is to load our config from JSON, enforcing our Config schema. Here’s a first cut–and where we’ll see what the problem is:

(->> (load-config-file)
     (s/validate Config))

;; Value does not match schema: {:users (not (set?
;; a-clojure.lang.PersistentVector)), :after (not (instance?
;; java.util.Date a-java.lang.String)), :format (not (#{:txt :json}
;; "txt"))}

Oh, snap. Literally nothing about that worked. It seemed so simple, but none of our three map entries are valid. But note that all three of the validation errors are quite similar: :users is not a set because JSON doesn’t have them, :after is not a date because JSON doesn’t have them, and :format is not a symbol (belonging to the set we specified) because JSON doesn’t have them. JSON simply isn’t expressive enough to represent our config. What’s a dev to do?

Let’s Get Coercive

This is where coercion comes in–we want to automatically transform the data based on the expectations of the schema. Logically we know that this process is going to require three things: (1) a schema, (2) a specification for how to transform, and (3) the data itself. Keep that in mind as you read the following code snippet:

(defn coerce-and-validate [schema matcher data]
  (let [coercer (coerce/coercer schema matcher)
        result  (coercer data)]
    (if (s-utils/error? result)
      (throw (Exception. (format "Value does not match schema: %s"
                                 (s-utils/error-val result))))
      result)))

(->> (load-config-file)
     (coerce-and-validate Config coerce/json-coercion-matcher))

(Don’t worry too much about the if statement–schema.coerce/coerce doesn’t throw exceptions like schema.core/validate so I’ve built a quick recreation of that functionality to maintain parity with the first example.)

Notice that we’re now using schema.coerce/json-coercion-matcher, which gets passed to schema.coerce/coerce along with our Config schema. What we get back is a function we can apply to a piece of data to transform that data to match the schema–or return an error if it can’t find a way fulfill the transformation.

For the moment, just regard json-coercion-matcher as a magical black box of goodness (we’ll dive into matchers in the next section), but the important thing to understand is that it contains instructions for transforming data. This particular matcher is provided with the schema library, and it encapsulates several common JSON to Clojure transformations.

Now when we try to load the config:

;; Value does not match schema: {:after (not (instance? java.util.Date
;; a-java.lang.String))}

This means that the json-coercion-matcher knew how to transform :users’s [Str] to a #{Str}, and :format’s Str to a keyword in the enumeration just based on the expectations laid out by our schema, and without us saying get the value at this key and transform it in that way. Awesome.

Two down, one to go.

Match Me Another

Since json-coercion-matcher didn’t magically transform that date string into a Date for us, it’s time to crack open the black box and learn how to write a matcher of our own. They’re really not that complicated. Fundamentally, a matcher is piece of code that’s handed a single node in the tree of input data and the corresponding node in the schema.

A matcher used by schema.coerce/coerce will be applied to every node in the input, resulting in one of three possible outcomes:

The matcher signals that it can’t be used here, based on the schema alone. In this case it fails fast, without even looking at the input data. (e.g. if a matcher only transforms to Dates, there’s no need to run the matcher unless the schema says we’re looking for a Date.)
The matcher returns transformed data–it knew how to transform the input data, so it did so.
The matcher returns the input data unchanged, effectively signaling that it doesn’t know how to transform the input data to the desired format. (Technically this is a subcase of transformation where the transformation is the identity function, but logically it’s a separate case.)

Keep those cases in mind as we look at some code:

(def datetime-regex #"\d{4}-\d{2}-\d{2}T\d{2}:\d{2}:\d{2}\.\d{3}Z")

(defn datetime-matcher [schema]
  (when (= Date schema)
    (coerce/safe
      (fn [x]
        (if (and (string? x) (re-matches datetime-regex x))
          (read-instant-date x)
          x)))))

Let’s break that down:

We’re actually expected to return either nil–this transformation doesn’t apply to this schema node (case #1 above)–or a closure (read: function) that attempts to transform the data. The when line says: if you’re not looking for a Date, this matcher can’t help you.
The if line checks the input value to make sure it’s something we can transform. We check that it’s a string, and not just any string, but a string in a format that we think we can parse. There’s no sense throwing every single string at read-instant-date–we just want to try to parse the ones that look like dates. If it looks like a date, we parse it (case #2). If it doesn’t, we return it unchanged (case #3).
Just because a string matches our datetime-regex, that doesn’t necessary mean it can be parsed by read-instant-date³, and in these cases read-instant-date will throw; that’s what coerce/safe is there for. This handy little utility will catch any exceptions and return the original input value unchanged (i.e. “it couldn’t be transformed”, case #3).

Not too bad, right? Cool. Now we can start putting it all together.

Keep in mind that we can’t just replace the original use of json-coercion-matcher with datetime-matcher or we’ll break the other two coercion cases we already fixed, so we’ll need to combine the two matchers:

(def config-matcher
  (coerce/first-matcher [datetime-matcher coerce/json-coercion-matcher]))

coerce/first-matcher is a matcher combinator that will return a matcher which, given a sequence of matchers, will apply the first matcher that reports it matches. Keep in mind this is based on that initial, schema-only, sans-data check (case #1). Once we find a matcher that says it can produce the desired output type, we apply it and live with whatever we get back. This is sufficient for the majority of cases where you want to apply multiple matchers.

Finally, keep in mind that while I’ve named this config-matcher, there’s nothing about it that is specific to the particular Config schema that we’re using. It represents a generic set of rules about how to transform JSON (or other input) into Clojure data, and we might well apply it to all JSON our application handles.

Ok, let ’er rip!

(->> (load-config-file)
     (coerce-and-validate Config config-matcher))

;; {:users #{"swiftonsecurity" "horse_ebooks"}
;;  :after #inst "2015-01-01T00:00:00.000-00:00"
;;  :format :txt}

Bingo! No validation errors, and all the configuration data we need with no manual data munging code.

Closing Remarks

Schema-driven transformations are super cool because they spare us from writing a whole bunch of fiddly, error-prone, repetitive code. They allow us to establish consistent data transformation rules that we can apply as narrowly or widely as we like, and provide schema-based mismatch errors that represent the end-to-end totality of data transformation, unlikely a validation at the border and another after a manual transformation step.

Keep in mind, this definitely isn’t just for config files. I think that is a useful, real-world scenario, but consider this approach any time you need to transform data to a well-structured format–a problem that nearly always arises when crossing boundaries from one data format (JSON, XML, CSV, YAML, edn, CLI params, envvars, DB data, etc.) to another. A particularly powerful case to consider is transforming web API parameters to application domain objects. That’s definitely a usage that I will be exploring more.

Thanks for reading! If you have any questions or great ideas, please feel free to leave a comment or hit me up on Twitter.

In particular, while the docs discuss the basics of coercion, as well as the details of writing walkers, it wasn’t immediately obvious how to add for support a non-core type, or how to augment the existing json-coercion-matcher with custom extensions. ↩︎
I’m also pulling in all of the dependencies we’ll need for the rest of the post so there’s no need to fuss with that later. ↩︎
Especially as I’ve been fairly lax about my regex. Consider cases like "2015-99-99T00:00:00.000-00:00". ↩︎