Ensuring API Parity with Laboratory

The Setup

At Dimagi we’ve been suffering under a Jython server for years. When we originally launched an experimental web client for our Java Phone and Android applications, Jython seemed like the logical bridge between our Java core libraries and Python-based Django web service. Jython allowed us to hook into our Java library directly and leverage our server team’s Python knowledge. We could avoid implementing an un-cool Java server and instead use a bleeding-edge technology – what could go wrong?

As our web client (codenamed CloudCare) grew from an experimental tool to one of our core offerings Jython’s shortcomings were exposed. Interaction with other Java libraries was difficult – getting Jython to play nicely with Postgres through JDBC was an exercise in trial and error, and patience. Debugging Java issues was a nightmare as we couldn’t step through any of the Java code from the Python environment. Dev tools like code completion and Find Usages were non-existent. Stack traces are mostly an illegible mess of interweaving Java and Python methods. And perhaps most difficult was the lack of thorough documentation for understanding the often very complicated interaction of Python and Java code.

Thus in the Fall of 2015 we made the decision to re-implement the entire server in Spring. This blog will cover more about this migration (which consumed the majority of my 2016) in later posts, but we’ll step over that in a sentence for now.

With the server up and running with partial functionality we successfully ran QA under a feature flag and migrated some internal domains onto the new backend. However, we still could not make ourselves fully confident that the new backend had full parity, and for good reason. The problem space for the server is simply massive.

Our product, CommCare, allows users to build large forms that can track and share cases and manage supply inventories. These forms can have about thirty types of questions, repeat groups, complex branching logic, multiple languages, multimedia, and umpteen other customizations. The server would be responsible for managing every aspect of using these forms – displaying the forms, performing the form and case transactions, managing the user’s database, etc.

All of this means that the complexity and variability of the inputs and outputs is massive and fully comprehensive testing is impossible. Ideally, we would test the feature with real data at product scale with no impact on the user experience. Enter Laboratory.

The Idea

When migrating from an old code path to a new one we add a toggle that allow us to route the function calls to the new path for QA and test projects without impacting live projects at all. Once we’re reasonably assured about the stability and correctness of the new path we can begin gradually migrating domains onto the new one – keeping an eye out for errors – until the old path is completely deprecated.

In Laboratory instead of choosing between the old and new code and the join point we run both in an ‘experiment’ abstraction and then compare the results for parity and timing. Then we return the result of the old path (including any exceptions thrown) so that the user experience is unchanged.


Our code was well suited for a Laboratory experiment. All of our calls to the old backend were routed through a single function. So, inserting an experiment here was quite straightforward:

def perform_experiment(d, auth, content_type):
    experiment = FormplayerExperiment(name=d["action"], context={'request': d})
    with experiment.control() as c:
        c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))
    with experiment.candidate() as c:
        # If we should already have a session, look up its ID in the experiment mapping. it better be there.
        if "session_id" in d:
            d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])
        c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + "/" + d["action"]))
    objects = experiment.run()
    return objects

First we create an instance of our laboratory.Experiment subclass (more on that later). Next, we make the post_data call to our old backend as before:

c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))

As the experiment control. Next, we run the new call as the experiment candidate

c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + "/" + d["action"]))

(The slightly different URL syntax is due to a difference in our old and new servers)

Finally, we “run” the experiment which performs the comparison and calls back to a publish() function that we define in our Experiment subclass.

The lines that I skipped above hint at one of the two major problems we had to solve in using Laboratory. First, during form entry we use session tokens to track the user’s state on the backend to avoid sending the entire form state on each request. This obviously presents a major difficulty in comparing the calls: namely, each browser request would contain a session key corresponding to some session object that exists on the old server, but not the new one where the record would have a totally different key. We resolve this issue by storing a mapping from the ‘control’ session ID to the corresponding ‘candidate’ session ID each time we create a new form:

# if we're starting a new form, we need to store the mapping between session_ids so we can use later
if (self.name == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

Then, every time we’re going to make a request to the candidate server we replace the old session_id key with its corresponding new one.

if "session_id" in d:
d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])

This allows us to manipulate the two sessions in parallel and continually compare the result. The final component of our experiment is the actual Experiment subclass:

class FormplayerExperiment(laboratory.Experiment):
    session_id_mapping = {}

    def publish(self, result):
        # if we're starting a new form, we need to store the mapping between session_ids so we can use later
if (self.name == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

In our case, we’re overriding the publish() function which is called when we have the result of our experiment and want to determine what information to store and, in our case, adding new session ID mappings if necessary. In order we are:

  • Adding a new session ID mapping when we’ve performed a new-form action
  • Logging the timing of the control and candidate
  • Comparing the values of the control and candidate and logging the difference if one exists

This last point ended up being the second tricky part of this process. We ended up with a handful of small edge cases that broke the equality checking. In some cases this was a difference in platforms  – booleans ended up as ‘1’ and 0′ in JSON output from Python, while they were ‘True’ and ‘False’ in Java. Further, we needed to ignore the SessionID and some other outputs like random ID’s from forms that would be necessary different. Finally all these values were inside a large JSON tree that we wanted to ensure were the same structure in both cases.



When we turned this on we immediately got actionable data. For example, one request generated the diff:

2016-06-08 22:17:53,675, new-form, {u'lang': u'en', u'session-id': None, u'domain': u'test2', u'session-data': {u'username': u'test', u'additional_filters': {u'footprint': True}, u'domain': u'test2', u'user_id': u'aa8c9268917a7c48fee7d30f3b12df92', u'app_id': u'8686d27973591f96f0ea0da8cd891323', u'user_data': …}

2016-06-08 22:17:53,674 INFO Mismatch with key ix between control 0J and candidate 0_-10

2016-06-08 22:17:53,674 INFO Mismatch with key relevant between control 1 and candidate 0

2016-06-08 22:17:53,674 INFO Mismatch with key header between control question1 and candidate None

2016-06-08 22:17:53,675 INFO Mismatch with key add-choice between control None - Add question1 and candidate None

The first line tells us this was a new-form request; the last three lines tell us that we weren’t generating a repeated question properly in this form. The second line tells us why – we weren’t properly encoding the index of this question in the form. This issue arose only arose under a specific ordering of question groups so we missed it during QA and likely would have introduced this bug “in the wild” if not for Laboratory.


For each call we also generated a comma separated list of values: the time of the request, the request type, and the control and experiment execution times:

2016-05-24 10:08:41,182, answer, 0:00:00.052370, 0:00:00.024141
2016-05-24 12:32:55,501, get-instance, 0:00:00.031898, 0:00:00.019106
2016-05-24 12:33:14,427, answer, 0:00:00.067201, 0:00:00.028408
2016-05-24 12:33:15,003, get-instance, 0:00:00.026185, 0:00:00.019381
2016-05-24 12:39:09,586, new-form, 0:00:03.156963, 0:00:02.310486
2016-05-24 12:39:36,425, get-instance, 0:00:00.022434, 0:00:00.016840
2016-05-24 12:47:26,084, evaluate-xpath, 0:00:05.037198, 0:00:00.016030
2016-05-24 12:51:47,146, evaluate-xpath, 0:00:00.045509, 0:00:00.018033
2016-05-24 12:52:02,801, evaluate-xpath, 0:00:00.033657, 0:00:00.023034
2016-05-24 15:55:59,949, new-form, 0:00:05.362653, 0:00:02.199996
2016-05-24 16:11:53,493, answer, 0:00:00.052868, 0:00:05.020280
2016-05-24 16:11:54,067, get-instance, 0:00:00.023180, 0:00:00.017108
2016-05-24 16:12:01,222, get-instance, 0:00:00.023897, 0:00:00.011998

Pulling this data into a (very rudimentary) graph we were able to validate the performance gains we’d expected from moving to Spring:



We viewed our experiment with Laboratory as a huge success. We not only caught bugs missed by QA that we likely would have released otherwise but also validated the performance gains we’d noticed anecdotally previously. Even if we’d found the equality comparison too unwieldy to implement, logging only the timing with a stripped down experiment would have had huge value as well.

One major advantage we had was that all calls were routed through the extant post_data method so that we could test all functions with minimal impact on the code. The largest obstacle was the parallel sessions, though the solution ended up being quite elegant.

Laboratory itself was very usable – the concept and implementation is straightforward and plugging in code where we needed to overwrite the default behavior was easy. Our biggest feature request would be the ability to make the candidate request and publishing asynchronously from the user-visible control call and return. Fears that blocking on our candidate function to return would cause performance issues led us to limit the rollout at first. We’ve now implemented our second experiment and will continue look for opportunities to experiment further.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s