Immutable data patterns

February 16, 2024

Before starting Canvas I never thought much about the data team. Engineers designed the data model to meet our business needs and technical requirements, then data teams tried to reverse engineer the data model to meet theirs. They occasionally asked us to explain some enum or beg us not to drop some column, but otherwise there was practically a Chinese wall between us. Software engineers had the keys and the data team took what they could get.

Now that I'm frequently on the other side of this wall, I self-motivatedly can see how unhelpful this arrangement was. The products created by data teams are often as valuable as those created by product engineers. They're essential to understanding your business. Helping the data team work better is good business practice. But I also think its good engineer practice. Most of the patterns that help data teams build analysis will help software teams improve their auditability and recoverability.

Updating fields

Every table update deletes the previous value for the rest of time. For most columns this is fine, for important fields this should never be done. For example:

  1. If the user increases the number of seats they've purchased for you want the ability to see when this happened as well as the previous value (to prorate payments)
  2. If the user adds to a list of white listed IP addresses, you need to retain the list of old values for compliance purposes
  3. If a user updates their email address, you want to keep a record of the old one since everyone - including you - relies on that email as a unique identifier somewhere

Keeping (1) is just plain useful in your code. Keeping (2) might be useful in your code (if the user wants to revert, for example), but at the very least you need it somewhere persistent for compliance. Keeping (3) might never be helpful, but its better to have and storage is cheap.

What are some of the patterns for preserving old values?

Add a new row

The simplest of all, when you want to update a value, just add an entirely new row, copying all the columns over except the one you updated. Keep a created_at column on each row, and when referencing the row always pull the latest by created_at (and index this column).

The main drawback on this one is that when you create a new row you're have to update all the foreign keys pointing to it as well. You can get around this by not using database constraints or database IDs and instead having the relationship be keyed on a separate unique ID that's repeated between all rows; then you can do the lookup in code for the most recent with that unique ID.

Add a property change event table

Create an audit table of property change events. This should have a few fields: TABLE_NAME, ROW_ID, COLUMN_NAME, PREVIOUS_VALUE, NEW_VALUE. The idea here is that you should be able to reconstruct the row at any point in time by apply the change events up until that time. You can include other information like the user or page that made the change for added auditability.

Audit columns

Add created_at and updated_at columns to all of your tables. Postgres supports created_at automatically. You might need to do some work for updated_at.

Enums

Use strings instead of numbers.

Naming

Incorrectly cased table and column names are a massive, massive pain on the ass on the warehouse side where there aren't ORMs (or really any abstractions) to protect you from manual string escaping.

Naming columns well - erring on the side of verbosity - is helpful for humans and (increasingly) LLMs trying to understand a table from the schema.

Datetimes

Keep it all UTC, please! Timezones are one of the gnarliest persistent head scratchers coders face. This is doubly true in most SQL dialects.