One problem with many existing MapReduce abstraction layers is the utter difficulty of testing queries and workflows. End-to-end tests are maddening to craft in vanilla Hadoop and frustrating at best in Pig and Hive. The difficulty of testing MapReduce workflows makes it scary to change code, and destroys your desire to be creative. A proper testing suite is an absolute prerequisite to doing creative work in big data.
In this blog post, I aim to show how most of the difficulty of writing and testing MapReduce queries stems from the fact that Hadoop confounds application logic with decisions about data storage. These problems are the result of poorly implemented abstractions over the primitives of MapReduce, not problems with the core MapReduce algorithms.
The Cascalog abstraction layer fixes this issue by separating logic from data, allowing you to play creatively at massive scale. (If you're not familiar with Cascalog, know that it's by far the most expressive MapReduce DSL in existence. We make heavy use of Cascalog here at Twitter. I recommend getting started with this introduction, then moving on to the wiki.)
Cascalog and its testing suite, midje-cascalog, allow us to test application logic in isolation; the resulting tests are truly beautiful.
Say you've implemented wordcount in MapReduce, and are looking to write unit tests against your workflow. The Hadoop consulting giant Cloudera defines current best practices in this article:
The current state of the art often involves writing a set of tests that each create a JobConf object, which is configured to use a mapper and reducer, and then set to use the LocalJobRunner (via
JobConf.set(”mapred.job.tracker”, “local”)
). A MapReduce job will then run in a single thread, reading its input from test files stored on the local filesystem and writing its output to another local directory.This process provides a solid mechanism for end-to-end testing, but has several drawbacks. Developing new tests requires adding test inputs to files that are stored alongside one’s program. Validating correct output also requires filesystem access and parsing of the emitted data files. This involves writing a great deal of test harness code, which itself may contain subtle bugs. Finally, this process is slow. Each test requires several seconds to run. Users often find themselves aggregating several unrelated inputs into a single test (violating a unit testing principle of isolating unrelated tests) or performing less exhaustive testing due to the high barriers to test authorship.
This is insane. Written this way, unit tests meant to target application logic are dwarfed by the maintenance they impose.
Most real world MapReduce applications involve chained sequences of many queries; these flows are a nightmare to coordinate and test using the above methods. Perhaps for this reason, Cloudera recommends testing mappers and reducers separately:
your
map()
andreduce()
calls should still be tested individually, as the composition of separate classes may cause unintended bugs to surface.
Again, this is totally batshit crazy. MapReduce queries are functions that operate on datasets. Imagine writing a Clojure function and testing its lines one at a time, in isolation; the approach makes no sense.
A well designed testing suite will examine each MapReduce query as the function that it is. Queries should be tested against a range of inputs, both in isolation and as components of more complicated workflows. It shouldn't matter how one's data is stored; query logic is independent of data storage, and should be tested as such.
I'd like to offer a more sane approach to MapReduce testing. In the following example, I'll write MapReduce query with Cascalog and test it using midje-cascalog, a thin wrapper I wrote over the Midje testing DSL.
Let's say you want to test a Cascalog workflow that examines your user datastore and returns the user with the greatest number of followers. Your workflow's top level query will generate a single tuple containing that user's name and follower-count. Here's the code:
(defn max-followers-query [datastore-path] (let [src (name-vars (complex-subquery datastore-path) ["?user" "?follower-count"])] (cascalog.ops/first-n src 1 :sort ["?follower-count"] :reverse true)))
max-followers-query
is a function that returns a Cascalog subquery. It works like this:
datastore-path
) and passes it into a function called complex-subquery
.
complex-subquery
returns a subquery that generates 2-tuples; this subquery is passed into name-vars
.
name-vars
binds this subquery to src
after naming its output variables ?user
and ?follower-count
.
first-n
returns a subquery that
src
in reverse order by follower count, and
At a high level, the subquery returned by max-followers-query
is responsible for a single piece of application logic:
?follower-count
from the tuples returned by (complex-subquery datastore-path)
.
A correct test of max-followers-query
will test this piece of logic in isolation.
If you were to follow Cloudera's advice on how to test this query, you would have to:
complex-subquery
, and figure out what data it depends on to produce its tuples;
max-followers-query
into a temporary SequenceFile;
This series of steps adds tremendous friction to the testing process, and obscures the purpose of the test. Moreover, the results of any test of max-followers-query
depend on any number of subqueries invoked the call to complex-subquery
.
If your tests are this difficult to write, you're not going to write very many tests. You need tests to be creative; without tests, you won't change production queries in fear of introducing some bug you don't understand.
Midje circumvents all of this complexity by mocking out the result of (complex-subquery datastore-path)
and forcing it to return a specific Clojure sequence of [?user ?follower-count]
tuples.
The following form is a Midje test, or "fact":
(fact?- "Query should return a single tuple containing [most-popular-user, follower-count]." [["richhickey" 2961]] (max-followers-query :path) (provided (complex-subquery :path) => [["sritchie09" 180] ["richhickey" 2961]]))
Facts make statements about queries. The fact passes if these statements are true and fails otherwise. The above fact states that
max-followers-query
is called with the argument :path
,
[ [ richhickey" 2961] ]
,
(complex-subquery :path)
produces [["sritchie09" 180] ["richhickey" 2961]]
.
Fact-based testing separates application logic from the way data is stored. By mocking out complex-subquery
, our fact tests max-followers-query
in isolation and proves it correct for all expected inputs.
This approach is not just better than the "state of the art" of MapReduce testing as defined by Cloudera; it completely obliterates the old way of thinking, and makes it possible to build very complex workflows with a minimum of uncertainty.
Fact-based tests are the building blocks of rock-solid production workflows.
If you're interested in how to construct fact-based tests with midje-cascalog, I go into great detail on functional MapReduce testing idioms and methods in Testing Cascalog with Midje.