Luka Kladaric

Chaos Guru
Consultant
Software Architect

The Platform is Dead; Long Live the Platform

(Updated February 15, 2024)

This essay is also available as a conference recording you can watch here →

If your team is just working on one piece of software, sure, how it lives online is just part of the deal, right? No one really thinks of it as its own thing. But if you’re cranking out a bunch of different apps or services, like all the cool kids are doing these days, chances are you’ve cooked up some standard ways to handle the whole “how does this thing even exist on the internet” dilemma. And hey, congrats, you’ve basically built your own internal developer platform.

This internal developer platform (IDP) is like the shared playground where all your engineering teams hang out to host their backend apps, websites, and whatever else they’re cooking up.

Back in 2019, my team and I put together an IDP, and let me tell you: lessons were learned.

The Old Way

My team spent most of 2017 cleaning up a home-grown setup for running about a dozen Python and Java applications, standardizing it, and converting everything to be config-as-code so there’s no tribal knowledge to pass down to new teammates. We standardized on Ansible, Jenkins, HAProxy, Nginx, and Icinga.

Ansible was our config as code tech of choice. We declaratively described how we wanted our servers to be configured, which software needed to be installed, and so on. But we also abused it for actual deploys to get new versions of code out to servers.

We used Jenkins for CI/CD, building artifacts and shipping them out along with configuration changes using Ansible. The build jobs were defined in code, living in a dedicated Git repository. This allowed us to define a single way of building one type of a thing (Android apps, for instance) and then just extend from that and override with the specifics of a single application. It ensured maximum code reuse and minimal divergence.

Sounds great, right?

I didn’t come up with it, but I contributed a lot and loved it.

The catch

It was indeed great if you were familiar with it. But if you just wanted to jump in to find something and tweak it slightly, it was impossibly complex.

Because we hosted multiple Python apps on a single box, we also had nginx on each box to route requests to the correct application. Because of this, adding new applications or hostnames for existing applications was an incredibly complex procedure, requiring the reconfiguration of both HAProxy and nginx. It was all defined in code, but those deploys were purposefully not automated and were always manually run step by step.

Icinga was our primary internal monitoring solution, also auto configured from Ansible, mostly out of the way, and did the job with very little complaining.

And the platform was good. It did its job and let us grow. We added applications to it mostly without any trouble. There was occasionally some manual work for the ops team, mostly around scaling.

We trucked along happily without significant hiccups for about a year, when we got the first request that made us go: “Yeah, we don’t really support that”.

The first signs of trouble

The request was Python 3.

Now, that shouldn’t have been difficult, but it was. We made it happen nonetheless, and a colleague figured out a way for an app to run either Python 2 or Python 3 on the same box with just a few lines of configuration changed.

Then someone asked for Golang, and we were just stumped. Our entire pipeline was Python or Java, the Java pipeline completely different from the Python one. To do Golang, we had to do a third completely different deployment pipeline. And then the next day, someone would ask for the fourth thing.

This was also roughly the first time that I heard Kubernetes suggested as a solution, mainly because everyone else was doing it. But I still truly liked what we had, and I didn’t want to throw it out, certainly not for something like Kubernetes, which would require replacing everything we already had with fairly similar components.

Not to mention, we had no experience or expertise with Kubernetes.

Time to evolve

So what actually broke our platform? It wasn’t Python 3, it wasn’t Golang - it was the background workers. It was a perfectly reasonable request from the engineers of an existing Python application.

We didn’t have anything like it, so we spent way too much time building support for it. It broke all of our assumptions about the platform. The platform that seemed so efficient grew into a monstrosity, and years later, I still hate going into that kind of pipeline and Ansible codebase to do anything. It became apparent that we reached the end of the road with this platform.

Our mistake was that each thing had its own deploy pipeline. The complexities of each different application, each different type, stretched from source code all the way to the end of the pipeline. So when a thing was weird in a different way than the existing ones, you have to have a weird different pipeline all the way through. We obviously had to find a different approach.

The answer: Containers

Containers let you concentrate complexity handling in the build stage, close to where the complexity originates - in the source code. Once you have an image, the deploy is the same, regardless of the stack. You could be shipping Java, Python, Go, Rust, side by side, because it’s all contained in the container. That’s why it’s called containers, after all.

We also wanted to make it possible for other teams in the company to self-manage most, if not all, of the infrastructure for their applications. If they had an idea they wanted to try out, they could at least spin up a dev instance of it without bothering the ops - and without making us the bottleneck.

I spent a few days trying to cram containers into our HAProxy Jenkins setup. The more I looked at it, the more obvious it became that I had to replace our components with more natural fits.

Jenkins was the first thing to go, replaced with Travis. Since Jenkins is self-hosted, it’s either over-provisioned and wasting money, slammed and blocking productivity, or just another thing you must manage autoscaling for. Travis is a much better fit for high-velocity engineering teams because it handles the scaling - when you’re not asking it to do anything, it’s not doing anything. And when you’re asking it to do many things, it scales up and does many things. Not to mention that a Travis file is also a lot easier to understand than a Jenkins file, just for the syntax.

Building container images allows the teams to explain the complexities of their codebase locally and to fully control how their applications are built and run. So, we built a container image using Travis but still needed a way to run it.

Amazon ECS & Fargate

There are several options for that - the aforementioned Kubernetes, a couple of other bad attempts at solving the problem, then a whole lot of nothing, and then Amazon ECS (Elastic Container Service).

I like to present ECS as Kubernetes, but built with AWS primitives - the services we know and love and have been using for years, like Elastic Load Balancer, Route 53 for DNS, etc. In this setup, ELB is used to route requests to your containers, and Route 53 is used for service discovery. It’s all of the things we use for other normal projects, legacy projects, just used slightly in a different way.

And then there is Fargate, which, for lack of a better definition, is a serverless flavor of ECS.

It lets you run containers without explicitly declaring how many workers you have, how big they are, how many machines there are. If you say “I want to run containers, here’s an image, give me 10 instances of it”, you get 10 instances of it.

You have no idea where they’re running, which is slightly scary, but it’s also awesome because it’s one less thing to worry about. You get to forget about where it’s going to run, about cluster capacity, you get to spin up 10x and spin down in the middle of the night, and not have to care.

I’m not an expert on Kubernetes. From what I know, it makes it easy to start building a massively distributed system and very painful to maintain it in production. I dislike the kitchen sink approach, and from what I’ve heard from people who do it on a massive scale, you need the human resources to run a data center before you consider Kubernetes for production use.

I may be 100% wrong, but so far, I haven’t been successfully convinced.

CloudFormation

I should note that the console supporting both classic vanilla ECS and Fargate is very confusing, and it’s safe to say you cannot successfully iterate using the console.

Luckily, you don’t have to since you can just use Amazon’s CloudFormation. CloudFormation is infrastructure as code, declarative creation and management of AWS resources.

In our setup, CloudFormation lets us deploy the basic shared stack for all our Amazon accounts. VPCs, subnets, gateways, firewall rules, load balancers, and DNS zones are all in one template. If we decide we need another zone (in addition to our production zone, test zone, and dev zone), we just create a fourth account and roll that template out.

Everyone knows they can expect the same network setup, layout, and basic utilities in each account. The Cloudformation templates for deploying applications and databases know they can call on that and expect it to be there.

The app template only describes what that application needs. Simple reuse means app templates are short and only contain the resources necessary for their operation.

Another great thing about the CloudFormation console is that you can see who deployed what and when things changed. And if something doesn’t need to exist, you delete the CloudFormation stack from the console, and it deletes it entirely, along with all the resources that were provisioned for it, like databases or buckets.

Even Icinga got replaced with CloudWatch, although it served us well in the past. I should note that Cloudwatch is inferior in configurability, frequency, and resolution, but it has one big upside - it just works and doesn’t need to be hosted.

But how does it work?

The base networking CloudFormation stack contains:

Each account has its own DNS zone (subnet): We use dev.company.com, prod.company.com, and test.company.com. Any resources that are deployed automatically get DNS records in that zone, so instead of dealing with ec2-34-205-15-150.compute-1.amazonaws.com your server can be ec2-name.dev.company.com. Instead of your database being called mydatabasecluster.cluster-c1a2b3d4e5f6.us-west-2.rds.amazonaws.com you can address it as mydatabasecluster-master.dev.company.com. This approach has made it significantly easier to detect misconfiguration due to copypasta.

To support applications that are only internal, there is a private load balancer available that can only be connected to from the internal network, but most applications will attach to the public one. Some applications will even deploy to both, to support both workloads.

It’s all deployed from GitHub via Travis to all AWS accounts on each commit. Application development teams are then free to pick their own workflow.

The CloudFormation Stack

The CloudFormation template - the stack for an individual application - defines the task definition, service, CPU, and RAM.

It defines a target group, listener rules, log group, and even task role and policies, which is very cool. Each application gets its own task role, and you can attach policies for what it’s allowed to do within Amazon - so if I need an application to read from a bucket, I can just put it straight in the policy, and that’s it.

Every application gets a DNS record called appname.dev.company.com that points to the load balancer that it is attached to. Just by an application getting deployed, it is fully set up to start accepting traffic, with no other infrastructure changes needed. When an application is shut down and destroyed, there are no stray DNS records to clean up.

Everything is in a single repo - the actual application code, Python, Java, Go, whatever. That also goes for the Docker file that explains how that application is built, the CloudFormation template that explains the infrastructure needs of the application, and the Travis file that explains how all those link together. Which branches are built and deployed or tested is also stored there.

It’s easy to follow even if you’re seeing the repo for the first time.

Reaping the benefits

The CI/CD flow goes like this: You push through the repo, Travis builds the image, sends it to AWS ECR (Elastic Container Registry), and the next job is the deploy, which runs CloudFormation deploy.

CloudFormation determines the differences between the current state and the desired state and then executes changes to make the desired state a reality.

To do a tl;dr, if you’re just here for the benefits, here they are:

  1. Introducing weird applications and things to the ecosystem doesn’t make things weird for others, unlike my 2017 example.
  2. There’s no self-managed single point of failure.
  3. Adding new environments is simple - we could easily use this to roll out an environment on a different continent.
  4. You can see your pull requests live before you merge. We can spin up a temporary version as soon as a pull request is made. If you have a QA team that clicks through an application before it goes live, they can do that with your instance.
  5. It’s trivial to auto-scale. You just say you need an application to run five instances, and if the CPU load on them is 80%, then spin up five more.

It’s not perfect, but…

Don’t get me wrong, this setup is not without faults (which one is, really?). CloudFormation is extremely fragile in the beginning. It’s a slightly hostile environment, and it takes a lot of trial and error to get to a place where you’re being somewhat productive.

But the biggest hurdle was building legacy applications in Docker. I cannot explain how frustrating it is to take things built in the Wild West and try to cram them into the nice restrictions of Docker. But this would happen with any containerization tech.

Trust me and start building your applications with Docker, even if it’s for local use.

You’ll thank me when you start going into something like this.

5 year retrospective

This platform has been in production use since 2019, and in that time the traffic grew by an even 100x, approaching 100,000 requests per second.

Some choices proved excellent, some needed tweaking, and some I regret.

However, after 5 years without a major incident, I’m happy to call the project a massive success, and I look forward to giving a more thorough retrospective in an upcoming talk and article.