Java String hashCode() Collision

Another week on interrupt duty, another Javarosa engine deep dive.

This morning we received a ticket reporting that a user’s form was pre-populating some questions with seemingly random answers. Further, some answer sequences would cause the form to skip entire question groups – even those that didn’t have display logic. My first thought was that the form was constructed improperly, but inspecting the XML everything seemed hunky dory. Still, I was able to reproduce the bug immediately using our command line tool, indicating a bug in our core Javarosa library (rather than in our Android wrapper).

Some background: CommCare’s forms are written to the XForm spec. These forms work by defining and populating an XML data model with logic and UI elements defined in the form’s data bindings. In CommCare, users construct this forms in our Form Builder web application and then fill out the forms on our mobile application, submitting the XML blocks back to the server. When users fill out forms, you can view this XML instance being populated. In the user’s form definition the (blank) data instance looks like this:

<H_food_type_repeat jr:template="">
  <H_calculate_food_type_id/>
  <calculate_food_type/>
  <H2a/>
  <H2b/>
  <H2c/>
  <H3A/>
  <H3B/>
</H_food_type_repeat>

But when I inspected the data instance from within the application the XML appeared differently:

 <H_food_type_repeat>
    <H_calculate_food_type_id>1</H_calculate_food_type_id>
    <calculate_food_type>Cabbage</calculate_food_type>
    <H2a/>
    <H2b/>
    <H2c/>
    <H3A/>
    <H2a/>
  </H_food_type_repeat>

Notice that the node appears twice; not only different from the form definition, but clearly illegal XML. Even more strangely, when I stepped into the code that builds that constructs this instance the XML was generated correctly – with the node intact!

At this point I began getting flashbacks to the infinity bug and accordingly suspected caching. Our engine is chock-full of caching and interning optimizations that greatly improve our performance but can lead to some gnarly bugs. “H2a” and H3B” also seem like dangerously similar variables names – a hash collision waiting to happen.

Sure enough the hashCode() function for these nodes returns identical values. This caused a collision at a critical junction when we were retrieving these objects so that we’d retrieve the node “H2a” when attempting to retrieve “H3B”. This caused us to replace H3B with H2a in the instance structure as well as to insert the answer to H2a into the node H3B.

At this point the fix was trivial (in this case we simply removed the optimization and performed a full equality check instead of just comparing hash values). However we were curious how this happened given that the probability of a hash collision with a good function across a 32-bit hash space should be exceedingly small.

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

One thing immediately jumps out – this function looks way too simple! Summarily, the character at position 0 is added directly, the character at position 1 is multiplied by 31, the character at position 2 by 31^2, etc. This can result in strings having equal hash codes fairly easily, particularly for shorter strings.

For example, if given two Strings X and Y of equal length and the ASCII values of the characters at position N differ by 1 (for example the characters ‘a’ and ‘b’ or, in this case, ‘2’ and ‘3’) then the ASCII values of the characters at position N+1 need only differ by 31 (in this case ‘a’ and ‘B’) in order for the hashCodes two strings to be completely identical! this is more common than you might expect because the ASCII Values of the upper and lower case versions of a letter differ by 32; so, int(‘A’) = 65 and int(‘a’) = 97. Given this its surprising a multiplier of 31 was chosen; however, some research reveals some interesting reasoning behind this:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

For large strings such a collision won’t be likely – most of our user’s questions have ID’s such as “mother_name” or “hiv_status” that won’t likely have collisions. However, longer clinical forms very often have question series named A1, A2… G12, etc. Given this, I’m actually surprised we hadn’t run into this issue before.

Again harking back to the Infinity bug, the core issue here was our mis-using canned Java functionality. The Java String hashcode() function was designed with HashMap keys in mind. In this application the hashing function’s speed is the priority and collisions aren’t a huge concern. However, in our setup hash collisions caused catastrophic failure. We should have used a hashing function with a more reliable distribution if we’d been intent on this setup.

Infinity and Beyond

One of the most interesting (read: frustrating) parts about managing an Android deployment are the myriad bugs you encounter that are particular to certain devices or OS versions. Most often these are bugs are hardware related; for example, external SD card reads won’t work on a certain device. Also common are software issues, where a device won’t respect an OS contract properly. Least common, in my experience, are true Android OS bugs that manifest as reliable hard crashes. These bugs are likely to be fixed and the devices quickly updated to newer, bug-fixed versions. However, in the international development world we often have to make do with older devices that cannot update their OS due either to hardware or network constraints. In these deployments obscure, old bugs linger for years – this is the story of one of those bugs.

On Friday September 16 (never accept bugs on Fridays) we received an internal bug report from a developer on one of our biggest deployments reporting that one device was consistently failing to look-up a case in its case database. All other devices could retrieve the case using the same model and query. The suspect device could look up other cases successfully, but not this one. The device wouldn’t crash but would just appear as though this case did not exist. Perhaps significantly (or perhaps not) this device was the only one running Android 4.X.

At this point I began a diagnostic process

  1. Examined the user’s restore payload which contains the user’s assigned cases. The “missing” case was present in the restore payload and formatted correctly to boot.
  2. Stepped through the code that parses the restore payload into database entries. By all appearances the device reads the XML and writes a database entry for the case.
  3. Examined the application’s SQLite database directly. The case appeared there and the index looked correct!

At this point I was baffled. The database looked entirely correct, I could perform the query myself and retrieve the case, and the form had no problems – everything worked for almost all devices! Dauntingly, all of this indicated that the bug was somewhere deep in the abyss of our XForm parsing code.

Some historical context. CommCare forms follow the XForm spec and our transactions like restoring users and form submissions are done with XML payloads in the same spirit. XML is decidedly un-hip now but these specs provide us with a robust framework for defining applications and user data.

CommCare was also originally written in J2ME for Java phones and so anticipated running on devices with memory counted in low megabytes. This means our core code is chock-full of techniques for reducing our memory overhead such as caching and interning. This is the world I’d now have to step into!

I could approach the problem from two places, the original write of the case or the lookup before form entry. I decided to start with the read since by all appearances the entry in the case database was completely correctly. Fortunately at this time I discovered conditional breakpoints, an amazing tool that allowed me to step into the code at exactly the point where we began attempting to search for the case by having the condition test whether the case’s index was being used.

Once I started looking around this code I noticed that at some point variables which should have contained the case id instead were set to Double.POSITIVE_INFINITY. Triangulating this eventually lead me to the function here where we attempt to cast the String look up key (in this case the case’s id) to a Double if possible since this gives us more performant equality and comparison operations.  If we determine that the String is a double we also add it to a parsing cache that we use for subsequent lookups. To determine if the String is a double we use the built in Java function:

Double ret = Double.parseDouble(attrValue);

Which ended up being the source of the bug. This function was accepting a hexadecimal string:

1d9098a0c23a0c83740547dd946e3783

And saying this was a valid double – Infinity, to be precise! The trouble, as always, lies in scientific notation. Quoting from my own pull request:

Unfortunately, until version 4.4 Android had a bug with the parseDouble function that you can read about here: https://android-review.googlesource.com/#/c/102376/. Because Java doubles use the exponential format MeP the “e” character could be in a valid double in certain circumstances. In the bug above case, the fixture ID was 1d9098a0c23a0c83740547dd946e378

So, the bug would parse the trailing 946e3783 as a valid exponential with mantissa even though the rest of the string was complete garbage. This would evaluate to infinity (or Double.POSITIVE_INFINITY to be precise) and then be stored in the cache table as a valid double mapping.

So, we stored this key’s value as being Infinity and thus later lookups for this string literal would fail. Worse still, CommCare had no idea anything had failed – from the perspective of the application all had gone well and we’d successfully cached a double, saving precious memory. Our solution was to simply check for scientific notation before parsing the double – functionality we already had, since XPath doesn’t recognize scientific notation.

A few lessons learned. Clearly, using the built in parseDouble function was a mistake since Java’s and XPath’s ideas of “what is a Double” diverge in this respect. We were using a function that would aggressively attempt to parse a String to a double, when really we wanted a function that would be much stricter since a false positive would be much more costly than a false negative. We also learned that bugs can exist and lie dormant for years before rearing their heads.

However, my primary takeaway from this was laughter since the bug was both hilariously obscure and adorably typical. Android developers are used to encountering strange, device specific bugs since every device and OS version diverge are special snowflakes. This one was so emblematic that I could only laugh.

Ensuring API Parity with Laboratory

The Setup

At Dimagi we’ve been suffering under a Jython server for years. When we originally launched an experimental web client for our Java Phone and Android applications, Jython seemed like the logical bridge between our Java core libraries and Python-based Django web service. Jython allowed us to hook into our Java library directly and leverage our server team’s Python knowledge. We could avoid implementing an un-cool Java server and instead use a bleeding-edge technology – what could go wrong?

As our web client (codenamed CloudCare) grew from an experimental tool to one of our core offerings Jython’s shortcomings were exposed. Interaction with other Java libraries was difficult – getting Jython to play nicely with Postgres through JDBC was an exercise in trial and error, and patience. Debugging Java issues was a nightmare as we couldn’t step through any of the Java code from the Python environment. Dev tools like code completion and Find Usages were non-existent. Stack traces are mostly an illegible mess of interweaving Java and Python methods. And perhaps most difficult was the lack of thorough documentation for understanding the often very complicated interaction of Python and Java code.

Thus in the Fall of 2015 we made the decision to re-implement the entire server in Spring. This blog will cover more about this migration (which consumed the majority of my 2016) in later posts, but we’ll step over that in a sentence for now.

With the server up and running with partial functionality we successfully ran QA under a feature flag and migrated some internal domains onto the new backend. However, we still could not make ourselves fully confident that the new backend had full parity, and for good reason. The problem space for the server is simply massive.

Our product, CommCare, allows users to build large forms that can track and share cases and manage supply inventories. These forms can have about thirty types of questions, repeat groups, complex branching logic, multiple languages, multimedia, and umpteen other customizations. The server would be responsible for managing every aspect of using these forms – displaying the forms, performing the form and case transactions, managing the user’s database, etc.

All of this means that the complexity and variability of the inputs and outputs is massive and fully comprehensive testing is impossible. Ideally, we would test the feature with real data at product scale with no impact on the user experience. Enter Laboratory.

The Idea

When migrating from an old code path to a new one we add a toggle that allow us to route the function calls to the new path for QA and test projects without impacting live projects at all. Once we’re reasonably assured about the stability and correctness of the new path we can begin gradually migrating domains onto the new one – keeping an eye out for errors – until the old path is completely deprecated.

In Laboratory instead of choosing between the old and new code and the join point we run both in an ‘experiment’ abstraction and then compare the results for parity and timing. Then we return the result of the old path (including any exceptions thrown) so that the user experience is unchanged.

Implementation

Our code was well suited for a Laboratory experiment. All of our calls to the old backend were routed through a single function. So, inserting an experiment here was quite straightforward:

def perform_experiment(d, auth, content_type):
    experiment = FormplayerExperiment(name=d["action"], context={'request': d})
    with experiment.control() as c:
        c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))
    with experiment.candidate() as c:
        # If we should already have a session, look up its ID in the experiment mapping. it better be there.
        if "session_id" in d:
            d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])
        c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + "/" + d["action"]))
    objects = experiment.run()
    return objects

First we create an instance of our laboratory.Experiment subclass (more on that later). Next, we make the post_data call to our old backend as before:

c.record(post_data_helper(d, auth, content_type, settings.XFORMS_PLAYER_URL))

As the experiment control. Next, we run the new call as the experiment candidate

c.record(post_data_helper(d, auth, content_type, settings.FORMPLAYER_URL + &amp;amp;amp;amp;amp;amp;quot;/&amp;amp;amp;amp;amp;amp;quot; + d[&amp;amp;amp;amp;amp;amp;quot;action&amp;amp;amp;amp;amp;amp;quot;]))

(The slightly different URL syntax is due to a difference in our old and new servers)

Finally, we “run” the experiment which performs the comparison and calls back to a publish() function that we define in our Experiment subclass.

The lines that I skipped above hint at one of the two major problems we had to solve in using Laboratory. First, during form entry we use session tokens to track the user’s state on the backend to avoid sending the entire form state on each request. This obviously presents a major difficulty in comparing the calls: namely, each browser request would contain a session key corresponding to some session object that exists on the old server, but not the new one where the record would have a totally different key. We resolve this issue by storing a mapping from the ‘control’ session ID to the corresponding ‘candidate’ session ID each time we create a new form:


# if we're starting a new form, we need to store the mapping between session_ids so we can use later
if (self.name == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

Then, every time we’re going to make a request to the candidate server we replace the old session_id key with its corresponding new one.

if "session_id" in d:
d["session_id"] = FormplayerExperiment.session_id_mapping.get(d["session_id"])

This allows us to manipulate the two sessions in parallel and continually compare the result. The final component of our experiment is the actual Experiment subclass:

class FormplayerExperiment(laboratory.Experiment):
    session_id_mapping = {}

    def publish(self, result):
        # if we're starting a new form, we need to store the mapping between session_ids so we can use later
if (self.name == "new-form"):
    control_session_id = json.loads(result.control.value)["session_id"]
    candidate_session_id = json.loads(result.observations[0].value)["session_id"]
    FormplayerExperiment.session_id_mapping[control_session_id] = candidate_session_id

In our case, we’re overriding the publish() function which is called when we have the result of our experiment and want to determine what information to store and, in our case, adding new session ID mappings if necessary. In order we are:

  • Adding a new session ID mapping when we’ve performed a new-form action
  • Logging the timing of the control and candidate
  • Comparing the values of the control and candidate and logging the difference if one exists

This last point ended up being the second tricky part of this process. We ended up with a handful of small edge cases that broke the equality checking. In some cases this was a difference in platforms  – booleans ended up as ‘1’ and 0′ in JSON output from Python, while they were ‘True’ and ‘False’ in Java. Further, we needed to ignore the SessionID and some other outputs like random ID’s from forms that would be necessary different. Finally all these values were inside a large JSON tree that we wanted to ensure were the same structure in both cases.

Output

Parity

When we turned this on we immediately got actionable data. For example, one request generated the diff:

2016-06-08 22:17:53,675, new-form, {u'lang': u'en', u'session-id': None, u'domain': u'test2', u'session-data': {u'username': u'test', u'additional_filters': {u'footprint': True}, u'domain': u'test2', u'user_id': u'aa8c9268917a7c48fee7d30f3b12df92', u'app_id': u'8686d27973591f96f0ea0da8cd891323', u'user_data': …}

2016-06-08 22:17:53,674 INFO Mismatch with key ix between control 0J and candidate 0_-10

2016-06-08 22:17:53,674 INFO Mismatch with key relevant between control 1 and candidate 0

2016-06-08 22:17:53,674 INFO Mismatch with key header between control question1 and candidate None

2016-06-08 22:17:53,675 INFO Mismatch with key add-choice between control None - Add question1 and candidate None

The first line tells us this was a new-form request; the last three lines tell us that we weren’t generating a repeated question properly in this form. The second line tells us why – we weren’t properly encoding the index of this question in the form. This issue arose only arose under a specific ordering of question groups so we missed it during QA and likely would have introduced this bug “in the wild” if not for Laboratory.

Timing

For each call we also generated a comma separated list of values: the time of the request, the request type, and the control and experiment execution times:

2016-05-24 10:08:41,182, answer, 0:00:00.052370, 0:00:00.024141
2016-05-24 12:32:55,501, get-instance, 0:00:00.031898, 0:00:00.019106
2016-05-24 12:33:14,427, answer, 0:00:00.067201, 0:00:00.028408
2016-05-24 12:33:15,003, get-instance, 0:00:00.026185, 0:00:00.019381
2016-05-24 12:39:09,586, new-form, 0:00:03.156963, 0:00:02.310486
2016-05-24 12:39:36,425, get-instance, 0:00:00.022434, 0:00:00.016840
2016-05-24 12:47:26,084, evaluate-xpath, 0:00:05.037198, 0:00:00.016030
2016-05-24 12:51:47,146, evaluate-xpath, 0:00:00.045509, 0:00:00.018033
2016-05-24 12:52:02,801, evaluate-xpath, 0:00:00.033657, 0:00:00.023034
2016-05-24 15:55:59,949, new-form, 0:00:05.362653, 0:00:02.199996
2016-05-24 16:11:53,493, answer, 0:00:00.052868, 0:00:05.020280
2016-05-24 16:11:54,067, get-instance, 0:00:00.023180, 0:00:00.017108
2016-05-24 16:12:01,222, get-instance, 0:00:00.023897, 0:00:00.011998

Pulling this data into a (very rudimentary) graph we were able to validate the performance gains we’d expected from moving to Spring:

lab.png

Takeaways

We viewed our experiment with Laboratory as a huge success. We not only caught bugs missed by QA that we likely would have released otherwise but also validated the performance gains we’d noticed anecdotally previously. Even if we’d found the equality comparison too unwieldy to implement, logging only the timing with a stripped down experiment would have had huge value as well.

One major advantage we had was that all calls were routed through the extant post_data method so that we could test all functions with minimal impact on the code. The largest obstacle was the parallel sessions, though the solution ended up being quite elegant.

Laboratory itself was very usable – the concept and implementation is straightforward and plugging in code where we needed to overwrite the default behavior was easy. Our biggest feature request would be the ability to make the candidate request and publishing asynchronously from the user-visible control call and return. Fears that blocking on our candidate function to return would cause performance issues led us to limit the rollout at first. We’ve now implemented our second experiment and will continue look for opportunities to experiment further.