One of the most interesting (read: frustrating) parts about managing an Android deployment are the myriad bugs you encounter that are particular to certain devices or OS versions. Most often these are bugs are hardware related; for example, external SD card reads won’t work on a certain device. Also common are software issues, where a device won’t respect an OS contract properly. Least common, in my experience, are true Android OS bugs that manifest as reliable hard crashes. These bugs are likely to be fixed and the devices quickly updated to newer, bug-fixed versions. However, in the international development world we often have to make do with older devices that cannot update their OS due either to hardware or network constraints. In these deployments obscure, old bugs linger for years – this is the story of one of those bugs.
On Friday September 16 (never accept bugs on Fridays) we received an internal bug report from a developer on one of our biggest deployments reporting that one device was consistently failing to look-up a case in its case database. All other devices could retrieve the case using the same model and query. The suspect device could look up other cases successfully, but not this one. The device wouldn’t crash but would just appear as though this case did not exist. Perhaps significantly (or perhaps not) this device was the only one running Android 4.X.
At this point I began a diagnostic process
- Examined the user’s restore payload which contains the user’s assigned cases. The “missing” case was present in the restore payload and formatted correctly to boot.
- Stepped through the code that parses the restore payload into database entries. By all appearances the device reads the XML and writes a database entry for the case.
- Examined the application’s SQLite database directly. The case appeared there and the index looked correct!
At this point I was baffled. The database looked entirely correct, I could perform the query myself and retrieve the case, and the form had no problems – everything worked for almost all devices! Dauntingly, all of this indicated that the bug was somewhere deep in the abyss of our XForm parsing code.
Some historical context. CommCare forms follow the XForm spec and our transactions like restoring users and form submissions are done with XML payloads in the same spirit. XML is decidedly un-hip now but these specs provide us with a robust framework for defining applications and user data.
CommCare was also originally written in J2ME for Java phones and so anticipated running on devices with memory counted in low megabytes. This means our core code is chock-full of techniques for reducing our memory overhead such as caching and interning. This is the world I’d now have to step into!
I could approach the problem from two places, the original write of the case or the lookup before form entry. I decided to start with the read since by all appearances the entry in the case database was completely correctly. Fortunately at this time I discovered conditional breakpoints, an amazing tool that allowed me to step into the code at exactly the point where we began attempting to search for the case by having the condition test whether the case’s index was being used.
Once I started looking around this code I noticed that at some point variables which should have contained the case id instead were set to Double.POSITIVE_INFINITY. Triangulating this eventually lead me to the function here where we attempt to cast the String look up key (in this case the case’s id) to a Double if possible since this gives us more performant equality and comparison operations. If we determine that the String is a double we also add it to a parsing cache that we use for subsequent lookups. To determine if the String is a double we use the built in Java function:
Double ret = Double.parseDouble(attrValue);
Which ended up being the source of the bug. This function was accepting a hexadecimal string:
And saying this was a valid double – Infinity, to be precise! The trouble, as always, lies in scientific notation. Quoting from my own pull request:
Unfortunately, until version 4.4 Android had a bug with the parseDouble function that you can read about here: https://android-review.googlesource.com/#/c/102376/. Because Java doubles use the exponential format MeP the “e” character could be in a valid double in certain circumstances. In the bug above case, the fixture ID was 1d9098a0c23a0c83740547dd946e378
So, the bug would parse the trailing 946e3783 as a valid exponential with mantissa even though the rest of the string was complete garbage. This would evaluate to infinity (or Double.POSITIVE_INFINITY to be precise) and then be stored in the cache table as a valid double mapping.
So, we stored this key’s value as being Infinity and thus later lookups for this string literal would fail. Worse still, CommCare had no idea anything had failed – from the perspective of the application all had gone well and we’d successfully cached a double, saving precious memory. Our solution was to simply check for scientific notation before parsing the double – functionality we already had, since XPath doesn’t recognize scientific notation.
A few lessons learned. Clearly, using the built in parseDouble function was a mistake since Java’s and XPath’s ideas of “what is a Double” diverge in this respect. We were using a function that would aggressively attempt to parse a String to a double, when really we wanted a function that would be much stricter since a false positive would be much more costly than a false negative. We also learned that bugs can exist and lie dormant for years before rearing their heads.
However, my primary takeaway from this was laughter since the bug was both hilariously obscure and adorably typical. Android developers are used to encountering strange, device specific bugs since every device and OS version diverge are special snowflakes. This one was so emblematic that I could only laugh.