Java String hashCode() Collision

Another week on interrupt duty, another Javarosa engine deep dive.

This morning we received a ticket reporting that a user’s form was pre-populating some questions with seemingly random answers. Further, some answer sequences would cause the form to skip entire question groups – even those that didn’t have display logic. My first thought was that the form was constructed improperly, but inspecting the XML everything seemed hunky dory. Still, I was able to reproduce the bug immediately using our command line tool, indicating a bug in our core Javarosa library (rather than in our Android wrapper).

Some background: CommCare’s forms are written to the XForm spec. These forms work by defining and populating an XML data model with logic and UI elements defined in the form’s data bindings. In CommCare, users construct this forms in our Form Builder web application and then fill out the forms on our mobile application, submitting the XML blocks back to the server. When users fill out forms, you can view this XML instance being populated. In the user’s form definition the (blank) data instance looks like this:

<H_food_type_repeat jr:template="">
  <H_calculate_food_type_id/>
  <calculate_food_type/>
  <H2a/>
  <H2b/>
  <H2c/>
  <H3A/>
  <H3B/>
</H_food_type_repeat>

But when I inspected the data instance from within the application the XML appeared differently:

 <H_food_type_repeat>
    <H_calculate_food_type_id>1</H_calculate_food_type_id>
    <calculate_food_type>Cabbage</calculate_food_type>
    <H2a/>
    <H2b/>
    <H2c/>
    <H3A/>
    <H2a/>
  </H_food_type_repeat>

Notice that the node appears twice; not only different from the form definition, but clearly illegal XML. Even more strangely, when I stepped into the code that builds that constructs this instance the XML was generated correctly – with the node intact!

At this point I began getting flashbacks to the infinity bug and accordingly suspected caching. Our engine is chock-full of caching and interning optimizations that greatly improve our performance but can lead to some gnarly bugs. “H2a” and H3B” also seem like dangerously similar variables names – a hash collision waiting to happen.

Sure enough the hashCode() function for these nodes returns identical values. This caused a collision at a critical junction when we were retrieving these objects so that we’d retrieve the node “H2a” when attempting to retrieve “H3B”. This caused us to replace H3B with H2a in the instance structure as well as to insert the answer to H2a into the node H3B.

At this point the fix was trivial (in this case we simply removed the optimization and performed a full equality check instead of just comparing hash values). However we were curious how this happened given that the probability of a hash collision with a good function across a 32-bit hash space should be exceedingly small.

public int hashCode() {
    int h = hash;
    if (h == 0 && value.length > 0) {
        char val[] = value;

        for (int i = 0; i < value.length; i++) {
            h = 31 * h + val[i];
        }
        hash = h;
    }
    return h;
}

One thing immediately jumps out – this function looks way too simple! Summarily, the character at position 0 is added directly, the character at position 1 is multiplied by 31, the character at position 2 by 31^2, etc. This can result in strings having equal hash codes fairly easily, particularly for shorter strings.

For example, if given two Strings X and Y of equal length and the ASCII values of the characters at position N differ by 1 (for example the characters ‘a’ and ‘b’ or, in this case, ‘2’ and ‘3’) then the ASCII values of the characters at position N+1 need only differ by 31 (in this case ‘a’ and ‘B’) in order for the hashCodes two strings to be completely identical! this is more common than you might expect because the ASCII Values of the upper and lower case versions of a letter differ by 32; so, int(‘A’) = 65 and int(‘a’) = 97. Given this its surprising a multiplier of 31 was chosen; however, some research reveals some interesting reasoning behind this:

The value 31 was chosen because it is an odd prime. If it were even and the multiplication overflowed, information would be lost, as multiplication by 2 is equivalent to shifting. The advantage of using a prime is less clear, but it is traditional. A nice property of 31 is that the multiplication can be replaced by a shift and a subtraction for better performance: 31 * i == (i << 5) - i. Modern VMs do this sort of optimization automatically.

For large strings such a collision won’t be likely – most of our user’s questions have ID’s such as “mother_name” or “hiv_status” that won’t likely have collisions. However, longer clinical forms very often have question series named A1, A2… G12, etc. Given this, I’m actually surprised we hadn’t run into this issue before.

Again harking back to the Infinity bug, the core issue here was our mis-using canned Java functionality. The Java String hashcode() function was designed with HashMap keys in mind. In this application the hashing function’s speed is the priority and collisions aren’t a huge concern. However, in our setup hash collisions caused catastrophic failure. We should have used a hashing function with a more reliable distribution if we’d been intent on this setup.

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s