Shortest Cycle Algorithm

One of the surprises of starting work as a software engineer after college was how little the coding I did professionally resembled the coding I did academically. I can count on one hand the number of times I’ve explicitly used the knowledge about algorithms and logic that I’d spent four hard years acquiring. Most of these “hard” problems are solved for us by libraries so the closest we usually come is plugging some package in where we need it.

So I was pretty happy when the opportunity recently arose to apply my knowledge of graph algorithms outside of the classroom. At Dimagi we allow users to define the data they want to collect using the XForm specification. Among other things, XForms allows users to define logical dependencies between questions. The value of question A might be a function of the answers to B and C; question D might or might not show up (in the parlance, be relevant) depending on the answer to question E; the answer to question F might be invalid depending on the answer to question G. Because XForms are declarative there isn’t a control flow; all questions are evalauted and re-evaluated until they are in the correct state. This means that cycles in XForms are illegal; they break this evaluation flow.

Until recently our parser was able to detect cycles and so declare the XForm illegal at validation time; however, while the algorithm was very fast at detecting a cycle it could not output the exact path of cycle, merely detect its existence. Of course this was pretty unhelpful to our users who would sometimes have forms with hundreds of questions and then have to find a needle in the haystack.

After much grumbling from our field team I decided to tackle this problem at a hackathon. There are plenty of algorithms for accomplishing this so I chose a simple option found on Stack Overflow. Put simply, perform a depth first search from every node keeping a reference to the first node; if you ever encounter the node you started with, you’ve found a cycle. In our case, nodes would be questions and the directed edges would be the types of references listed above (calculations, relevency conditions, and validation conditions).

Easy enough! The algorithm was working fine until recently when we got a couple reports of the page where people edit their forms hanging indefinitely. Both forms were very large (over a thousand nodes) and because they were clustered in large top-level groups they were highly connected. (This is because if question B and C are both sub-questions of group A, then the relevency state naturally depends on the relevancy state of group A.) We narrowed in on the validate_form endpoint being the culprit. When I didn’t find any errors in Sentry, it became clear that the problem was one of efficiency, not incorrectness.

After replicating the bug locally, setting some break points, and observing the logic flow I quickly honed in on the problem: repeated work. Because questions could be nested in five or ten layers of groups the connectedness of the graph was massive. Further, if you had ten questions in a sub-group that itself was nested in four or five sub-groups then those ten questions would be calculating almost identical dependency graphs. We were repeatedly walking the same walk.

The solution, as always, was caching. When we completed a new depth first search from a node we would save the set of nodes that we reached from that node – the list of nodes in that node’s sub-graph. Then, before starting a DFS on a new node, we would check to see if we’d already cached the reachable nodes from that node. If so, we could check the set of reachable nodes; if the set of reachable nodes does not contain the node we are currently searching for cycles from, then we know that a cycle is not possible from this sub-search and we short circuit.

After deploying this code the issue was immediately resolved. You can see the code here; currently its quite paired to our XForm implementation but the algorithm itself operates only on Strings so it would be straightforward to re-purpose.

One thought I had while implementing this was the emphasis placed on caching academically and professionally. Specifically, I remember caching being mentioned mostly as an aside in college and discussed thoroughly only in my Operating Systems class where caching was essential to reasonable performance. For the most part the algorithms classes I took were interested in the correctness and complexity of the algorithm; by this standard, the original algorithm I’d written was completely correct. However, in the professional world that algorithm was not even close to correct; its slow performance caused a breaking error in our product. So often good caching is essential to our products working at all, let alone efficiently.

There’s another conversation I’ve been having recently about the divergence of academic and vocational coding that I want to write abotu soon, but that’s for another day!

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out /  Change )

Twitter picture

You are commenting using your Twitter account. Log Out /  Change )

Facebook photo

You are commenting using your Facebook account. Log Out /  Change )

Connecting to %s