Our first architecture at Canvas consisted of three backend services:
- A general purpose NodeJS service for handling CRUD operations, serving the GraphQL endpoint, running background jobs.
- A Rust service for processing data queries efficiently, as well as managing the CRDT and websockets that enable concurrent editing
- Another NodeJS service dedicated to storing and using external secrets - primarily warehouse keys. This was separate from (1) due to security considerations - we did not want this service exposed to public internet.
This design meant we'd have a services architecture rather than a monolith; on AWS this meant choosing between ECS Fargate and Kubernetes (EKS).
First we built with ECS since we didn't have experience with either tool and ECS was simpler. We ran ECS for a while but eventually migrated to Kubernetes (k8s) due to few shortcomings with ECS (note: this was early 2021):
- Most importantly, setting up service-to-service communication was challenging. Calls from our Rust service to our query service needed to be encrypted, load balanced, and never leave the internal VPC. ECS did not have a story for this, though it looks like they've added it since. Kubernetes supported several service meshes - we used Istio.
- We needed NGINX to use as a reverse proxy, as well as for consistent hashing. Kubernetes with Helm had the best support for managing this as code.
- Setting up DataDog for monitoring was much easier using on k8s (using helm)
I've now spent over three years running multiple clusters. I'd hardly call myself an expert, but in the past few years I've learned a lot that I think could be helpful for other founders facing the same decisions I did. In no way is this meant as comprehensive introduction to k8s and helm, but just some greatest hits from our past three years.
Most important, the site stays up. Even when we hit edge cases (such as a user loading a table with a column with 16MB JSON rows) that bring down a pod, k8s brings a replacement right back up. This can even be pernicious - you can get away with having memory leaks in your code, because kubernetes will usually kill and replace the pod before it becomes a problem.
Blue-green deployments are easy to setup with k8s. We've been saved by this guardrail multiple times.
The DataDog - EKS story is really smooth. Logs from all of our services appear after installing the DataDog agent in the cluster. The prebuilt dashboards and alerts have almost all the information you need.
Helm is excellent - this is the killer app of k8s. You might've noticed from my list above, the three bullets I listed correspond to three Helm charts - load balancer, service mesh, monitoring. Instead of installing these with some UI or running console commands you can do the configuration with code. You can also use charts to deploy different versions of your applications for different customers.
SSHing into any of your pods is easy. The kubernetes control plane gives you a lot of information. Its nice being able to debug your infrastructure issues in one place.
The biggest drawback is their aggressive release and deprecation schedule. This isn't like Windows where you can rely on all your old programs working - k8s releases often involve major breakages. I've done most of my Kubernetes learning while reading the Breaking Changes section and during subsequent inspections of my cluster. EKS is pretty generous on long term support, but they charge generously for their generosity.
These breaking changes were often the first time I had to break through some Kubernetes abstraction - and those abstractions hide some truly incredible depths. For example, if you're using a service mesh with HTTPS then you're running a little certificate authority that issues new certificates to new pods. You probably won't need to know about this until your root cert expires! I also didn't learn about the add-on EKS installs on my nodes until the image I was using stopped releasing new versions.
This class of problems reminds me of dependency management with npm or pip. Occasionally some security update will trigger a cascade of version mismatches between packages so far down the stack you didn't know you even used them. And the only person facing this exact mismatch is you. k8s/helm is a bit like that, but for infra - so, a lot scarier.
I don't think kubernetes secret management story is great, though maybe secrets are always a pain.
Finally, it took a long time for the kubernetes CLI commands to stick in my brain. Maybe that's a me problem.
Looking back I think I'd make the Kubernetes trade-off again. Spending time upgrading the cluster every few months was the biggest unexpected drawback. In terms of availability and reliability it exceeded all expectations. I love having all of our infrastructure in code and adding new services is trivial. But we'll see how I feel after our next upgrade in November.