I've been working on a Cascalog testing suite these past few weeks, an extension to Brian Marick's Midje, that eases much of the pain of testing MapReduce workflows. I think a lot of the dull work we see in the Hadoop community is a direct result of fear. Without proper tests, Hadoop developers can't help but be scared of making changes to production code. When creativity might bring down a workflow, it's easiest to get it working once and leave it alone.
The antidote to all of this fear is a functional testing suite. As I discussed in Getting Creative with MapReduce, Hadoop workflows are difficult to test at all; testing application logic in isolation of data storage is impossible.
Cascalog is free of this weakness. midje-cascalog allows you to test Cascalog queries as pure functions, both in isolation and as components of more complicated workflows. the resulting tests are truly beautiful.
I'll start by introducing midje-cascalog's testing operators, then move on to a Cascalog implementation of Word Count, tests included. You can find all source code from this post on github.
In this section, I'll discuss midje-cascalog's testing operators: fact?-
and fact?<-
. (The syntax mirrors ?-
and ?<-
, Cascalog's query execution operators.) These operators provide the abstractions necessary for testing complex Cascalog workflows. Add them to your namespace by including (:use midje.cascalog)
in the namespace header.
Let's begin by defining a function to test:
(defn mk-inc-query [src] (<- [?a ?b] (src ?a) (inc ?a :> ?b)))
mk-inc-query
accepts a source of 1-tuples and returns a query that generates 2-tuples. To test that mk-inc-query
actually does this, you need to:
mk-inc-query
with tuples and
Each of the following forms uses the fact?-
operator to state a distinct "fact" about our query. fact?-
expects a sequence of result tuples followed by the query tasked with producing them.
These two facts about mk-inc-query
are true, and pass:
;; The query returned by (mk-inc-query [[1]]), ;; when executed, ;; returns a single tuple: [1 2] (fact?- [[1 2]] (mk-inc-query [[1]])) ;; fact is true! ;; The query returned by (mk-inc-query [[1] [10]]), ;; When executed, ;; returns two tuples: [10 11] and [1 2] (fact?- [[10 11] [1 2]] (mk-inc-query [[1] [10]])) ;; fact is true!
This fact is false, and fails:
;; The query returned by (mk-inc-query [[1]]), ;; when executed, ;; returns a single tuple: ["fail!" 10]. (fact?- [["fail!" 10]] (mk-inc-query [[1]])) ;; fact is FALSE!
fact?-
can take multiples pairs of result-tuples and queries:
;; Same as two true facts above. (fact?- [[1 2]] (mk-inc-query [[1]]) [[10 11] [1 2]] (mk-inc-query [[1] [10]])) ;; both facts are true!
Strings are ignored wherever they appear, so feel free to pepper your facts with comments.
(fact?- "These results:" [[1 2]] "Are produced by this query:" (mk-inc-query [[1]])) ;; true
Note that facts don't have to be top level forms. It's perfectly acceptable to wrap facts in let
, if it makes the test clearer:
(let [src [[1]] results [[1 2]]] (fact?- results (mk-inc-query src))) ;; true
Cascalog pipes quite a bit of logging to stdout
. Facts suppress this logging by default, only showing entries with a FATAL log level.
If you want to see more information on fact execution, you customize the log level by placing a keyword at the beginning of your fact:
(fact?- :info [[1 2]] (mk-inc-query [[1]])) ;; true
As of version 0.2.1, midje-cascalog
supports the following log-level keywords, and defaults to :fatal
:
:off :fatal :warn :info :debug
The fact?<-
operator allows you to define a test a query within the same form. The following two facts are equivalent:
(let [src [[1]]] (fact?- [[1 2]] (<- [?a ?b] (src ?a) (inc ?a :> ?b)))) ;; true (let [src [[1]]] (fact?<- [[1 2]] [?a ?b] (src ?a) (inc ?a :> ?b))) ;; true
Where fact?-
is useful for testing full queries and workflows, I find fact?<-
useful mostly for testing how def*op
functions behave inside of queries.
If you want to stub out an unfinished test and prevent it from throwing errors, you can use future-fact?-
, like so:
(future-fact?- "unwritten-query will convert input integer tuples to strings." [["one"] ["two"]] (unwritten-query [[1] [2]])) (let [src [[1] [2]]] (future-fact?<- "num->string is unwritten." [["one"] ["two"]] [?string] (src ?num) (num->string ?string)))
future-fact?-
and future-fact?<-
prevent their forms from being evaluated.
If you include a string at the beginning of a stubbed fact, it shows up in Midje's test report looking like this:
WORK TO DO: unwritten-query will convert input integer tuples to strings. WORK TO DO: num->string is unwritten.
The fact?-
and fact?<-
operators provide the tools necessary to test complex MapReduce workflows as pure functions. Let's expand on these concepts by creating a small project with Cascalog code we'd like to test.
To add midje-cascalog
support to your own project, add these entries to to the :dev-dependencies
vector within project.clj
:
[lein-midje "1.0.3"] [midje-cascalog "0.2.1"]
And add (:use [midje sweet cascalog])
to the namespace declaration of each of your testing namespaces.
Let's begin with an implementation of word count, the typical "Hello World!" of MapReduce. A word counting application must be able to read in any number of textfiles and generate tuples of the form [word, count]
for each distinct word across all files.
The following code accomplishes this nicely. (Bear with me! a detailed discussion follows the code block.)
(ns cascalog.testing-demo.core (:use cascalog.api) (:require [cascalog.ops :as c]) (:gen-class)) (defmapcatop split "Accepts a sentence 1-tuple, splits that sentence on whitespace, and emits a single 1-tuple for each word." [^String sentence] (seq (.split sentence "\\s+"))) (defn wc-query "Returns a subquery that generates counts for every word in the text-files located at `text-path`." [text-path] (let [src (hfs-textline text-path)] (<- [?word ?count] (src ?textline) (split ?textline :> ?word) (c/count ?count)))) (defn -main "Accepts the following arguments: - text-path (path to a textfile, or directory with textfiles) - results-path (location of textfile containing results) And prints lines of the form \"word count\" to a textfile at results-path. Each distinct word in the textfiles at text-path gets a count." [text-path results-path] (?- (hfs-textline results-path) (wc-query text-path)))
The -main
function is the entry point to the word counting program. -main
passes text-path
on to wc-query
, and writes all tuples generated by the returned query to a text file at results-path
.
All of our program's application logic occurs in the query returned by wc-query
; this is the most important function to test. Let's discuss how wc-query
works:
wc-query
is a function that returns a subquery.
hfs-textline
internally to generate a source of ?sentence
tuples.
split
, a Cascalog function that creates words from sentences, like this:
(let [sentence [["two words"]] words [["two"] ["words"]]] (fact?<- "split converts a sentence into words." words [?word] (sentence ?sentence) (split ?sentence :> ?word)))
cascalog.ops/count
function
[?word ?count]
pair.
This logic looks right, but the only way to tell is to write a series of facts and see if they're true.
Let's put our tests in ./test/cascalog/testing_demo/core_test.clj
(mirroring the core.clj
, with _test
tacked on):
(ns cascalog.testing-demo.core-test (:use cascalog.testing-demo.core cascalog.api [midje sweet cascalog]) (:require [cascalog.ops :as c]))
Here's an initial try at a test of wc-query
using fact?-
:
;; /path/to/textfile points to a textfile with a single line: ;; "another another word" (fact?- "wc-query should count words from all lines of text at /path/to/textfile." [["word" 1] ["another" 2]] (wc-query "/path/to/textfile")) ;; FALSE!
This fact fails. Here are a few of its problems:
wc-query
is correct.
hfs-textline
. if hfs-textline
fails, our fact fails.
Testing wc-query in isolation is difficult! How can one test the logic of wc-query-
without regard to how lines of text are stored?
The solution lies in Midje's ability to mock out a function's return values. Midje can hijack hfs-textline
and force it to return anything you choose inside the body of a fact.
Using Midje's provided
form, the above fact passes:
(fact?- "wc-query should count words from all input sentences." [["word" 1] ["another" 2]] (wc-query :path) (provided (hfs-textline :path) => [["another another word"]])) ;; true
This fact states
wc-query
is called with :path
,
["word" 1]
and ["another" 2]
,
(hfs-textline :path)
produces a single tuple: ["another another word"]
.
Here's another true fact about wc-query
that uses multiple input sentences:
(def short-sentences [["this is a sentence sentence"] ["sentence with this is repeated"]]) (def short-wordcounts [["sentence" 3] ["repeated" 1] ["is" 2] ["a" 1] ["this" 2] ["with" 1]]) ;; when =wc-query= is called with =:text-path= ;; it will produce =short-sentences=, ;; provided =(hfs-textline :text-path)= produces =short-wordcounts=. (fact?- short-wordcounts (wc-query :text-path) (provided (hfs-textline :text-path) => short-sentences)) ;; true
A provided
form only applies to the result-query pair directly above. The first fact is false, while the second fact is true:
(let [sentence [["two words"]] results [["two" 1] ["words" 1]]] (fact?- "provided form won't apply here!" results (wc-query :path) ;; false "provided applies here." results (wc-query :path) ;; true (provided (hfs-textline :path) => sentence)))
In the above facts, I used keywords (:path
) as mocking arguments. Any form that evaluates to itself can be used as a mocking argument. In vanilla Clojure, this includes strings, numbers and keywords. Midje adds any symbol surrounded by dots (..path..
, .path.
, etc.) to this mix.
These facts about wc-query
from above are all true, and identical:
(fact?- "Mocking with keywords," [["one" 1]] (wc-query :path) (provided (hfs-textline :path) => [["one"]]) "strings," [["one" 1]] (wc-query "path") (provided (hfs-textline "path") => [["one"]]) "numbers," [["one" 1]] (wc-query 100) (provided (hfs-textline 100) => [["one"]]) "and Midje dotted symbols." [["one" 1]] (wc-query ..path..) (provided (hfs-textline ..path..) => [["one"]]))
As discussed, the provided
form only applies to the result-query pair directly above. This limitation can make for repetitive facts, when each fact depends on a mocked result:
(defn text->words [path] (let [src (hfs-textline path)] (<- [?word] (src ?sentence) (split ?sentence :> ?word) (:distinct false)))) (let [sentence [["two two"]]] (fact?- "text->words cuts text into words." [["two"] ["two"]] (text->words :path) (provided (hfs-textline :path) => sentence) "wc-query converts a sentence into words." [["two" 2]] (wc-query :path) (provided (hfs-textline :path) => sentence)))
Midje allows facts to share mocked functions with against-background
. An against-background
form placed anywhere inside the body of fact?-
will apply to all facts inside the form:
(let [sentence [["two two"]]] (fact?- "text->words cuts text into words." [["two"] ["two"]] (text->words :path) "wc-query converts a sentence into words." [["two" 2]] (wc-query :path) "wc-query fact with difference inputs." [["what" 1] ["a" 1] ["world!" 1]] (wc-query :path) (provided (hfs-textline :path) => [["what a world!"]]) (against-background (hfs-textline :path) => sentence)))
Note that the third of the three above facts used its own provided
form. When the two forms are mixed, provided
takes precedence, shadowing against-background
if need be (as above).
For the next set of facts, let's introduce a larger set of input sentences:
(def longer-sentences [["Call me Ishmael. Some years ago -- never mind how long"] ["precisely -- having little or no money in my purse, and"] ["nothing particular to interest me on shore, I thought I"] ["would sail about a little and see the watery part of the world."]])
One issue with the above facts is that they use very small input sentences. wc-query
will produce a rather large sequence of <word, count>
pairs for a moderate number of input sentences. Facts like this are overwhelming:
(fact?- [["Ishmael." 1] ["Some" 1] ["a" 1] ["about" 1] ["ago" 1] ;; and on and on... ] (wc-query :path) (provided (hfs-textline :path) => longer-sentences))
To solve this, Midje provides a number of collection checkers that provide you with finer control over how queries are compared with result sequences.
just
is the default checker for fact?-
and fact?<-
; bare vectors of tuples resolve to (just result-vec :in-any-order)
. The following three facts are equivalent:
(let [src [[1] [2]] query (<- [?a ?b] (src ?a) (inc ?a :> ?b))] (fact?- "Just form, fully qualified." (just [[2 3] [1 2]] :in-any-order) query ;;true "Wrapping tuples in a set is identical to including the :in-any-order modifier." (just #{[2 3] [1 2]}) query ;; true "midje-cascalog lets us drop these wrappers." [[2 3] [1 2]] query)) ;; true
Each of these facts checks that its subquery returns [2 3]
[1 2]
exclusively, in any order. Any missing or extra tuples in the result vector will cause a failure.
Note that dropping the :in-any-order
modifier (or the set wrapper) will cause facts to fail if ordering doesn't match. This makes sense sometimes when checking against top-n queries, as noted in the discussion below on has-prefix.
The contains
form allows facts to check against a subset of query tuples. By default, contains
requires result tuples to be contiguous and ordered: [1 2]
within [3 4 1 2 1]
, for example.
These restrictions are quite limiting for most Cascalog queries. The following two facts avoid both restrictions:
(fact?- (contains #{["sail" 1] ["Ishmael." 1]} :gaps-ok) (wc-query :path) ;; true (contains [["sail" 1] ["Ishmael." 1]] :gaps-ok :in-any-order) (wc-query :path) ;; true (against-background (hfs-textline :path) => longer-sentences))
The above facts test that both ["sail" 1]
and ["Ishmael." 1]
appear somewhere in the results, in any order.
:in-any-order
keyword, relaxes the ordering restriction.
:gaps-ok
keyword relaxes the restriction that tuples must contiguous.
has-prefix
checks that the supplied tuple sequence appears at the beginning of the query's results. has-prefix
only makes sense with queries that return sorted tuples.
The following fact states that ["--" 2]
, ["I" 2]
and ["and" 2]
, in order, are the three most common words across all words in longer-sentences
:
(fact?- (has-prefix [["--" 2] ["I" 2] ["and" 2]]) (-> (wc-query :path) (c/first-n 10 :sort ["?count"] :reverse true)) (provided (hfs-textline :path) => longer-sentences)) ;; true
has-suffix
checks that the supplied tuple sequence appears at the end of the query's results.
The following fact states that ["world." 1]
, ["would" 1]
and ["years" 2]
, in order, are the last three words (by alphabetical order) across all words in longer-sentences
:
(fact?- (has-suffix [["world." 1] ["would" 1] ["years" 1]]) (-> (wc-query :text-path) (c/first-n 100 :sort ["?word"])) (provided (hfs-textline :text-path) => longer-sentences)) ;; true
As with has-prefix
, facts making use of has-suffix
only make sense when specifically testing tuple ordering.
In certain cases, you might like to test a single query against a wide range of inputs and outputs. This quickly grows repetitive:
(fact?- [["mock" 1] ["it" 1] ["out!" 1]] (wc-query :path) (provided (hfs-textline :path) => [["mock it out!"]]) ;;true [["two" 3]] (wc-query :path) (provided (hfs-textline :path) => [["two two two"]]) ;;true [["M.M" 1] ["nathan" 1]] (wc-query :path) (provided (hfs-textline :path) => [["nathan M.M"]])) ;; true
Gah! against-background
doesn't work here, since these facts mock against different sentences each time.
Midje's tabular
form provides an elegant way to collapse this repetition:
(tabular (fact?- "Tabular generates lots of facts, one for each set of substitutions in the table below." ?results (wc-query :path) (provided (hfs-textline :path) => [[?sentence]])) ?sentence ?results "mock it out!" [["mock" 1] ["it" 1] ["out!" 1]] "two two two" [["two" 3]] "nathan M.M" [["M.M" 1] ["nathan" 1]]) ;; 3 true facts
(This one's a little involved, but the results are really beautiful.)
tabular
accepts three types of arguments:
fact?-
or fact?<-
templating form
?
(?sentence
and ?results
, in the above fact)
and generates a separate fact for every substitution row. It does this by substituting each value into the templating form in place of the header variable at the top of column.
The first fact generated by the above tabular fact looks like this:
(tabular ;; Tabular takes this templating form: (fact?- "Tabular generates lots of facts, one for each set of substitutions in the table below." ?results (wc-query :path) (provided (hfs-textline :path) => [[?sentence]])) ;; and substitutes these variables: ?sentence ?results "mock it out!" [["mock" 1] ["it" 1] ["out!" 1]]) ;; true ;; to produce this fact: (fact?- [["mock" 1] ["it" 1] ["out!" 1]] (wc-query :path) (provided (hfs-textline :path) => [["mock it out!"]])) ;; true
Any variable prefixed by ?
that appears inside both the fact template AND the header variables row is earmarked for substitution. This means that cascalog dynamic variables are totally safe, and play well with tabular.
Once you write facts within a project, you can use lein-midje to run them all and generate a summary like this:
Checking function: (midje.sweet/just [["Ishmael." 1] ["Some" 1] ["a" 1] ["about" 1] ["ago" 1]] :in-any-order) The checker said this about the reason: Expected five elements. There were thirty-nine. FAILURE: 6 facts were not confirmed. (But 37 were.)
If you're using the leiningen build manager, follow these steps:
[lein-midje "1.0.7"]
to the :dev-dependencies
entry in your project.clj
lein midje
at the command line in your project's root directory.
This command runs all facts and tests in the project and prints a summary of all results to stdout.
If you're using Cake, follow the steps on the Midje wiki for installing and running cake midje
.
If you currently write deftest
style tests using clojure.test, check out Midje's tips on integration. The two modes work very well together. lein midje
and cake midje
will evaluate all deftest
forms inside of a project and include the results in its report.
I believe that midje-cascalog is the most advanced MapReduce testing suite available today. The primitives discussed here make testing Cascalog queries a joy; the confidence that comes from fully tested components is a prerequisitive for creative work at large scale.
Please let me know what you think of the project! I'm happy to extend midje-cascalog in any way that helps the cause. Have fun testing!