Luka Kladaric

Chaos Guru
Consultant
Software Architect

Shipping quality software in hostile environments

(Updated February 28, 2024)

This essay is also available as a conference recording you can watch here →

I once had the opportunity to work for a startup that had fallen from tech debt into tech bankruptcy. Although we managed to get it back on the right track, it made me rethink the concept of tech debt and how we ship software - especially in hostile environments.

For me, a hostile environment is any place where software engineering is seen as the implementing workforce of ideas that come from outside - with little to no autonomy. Usually, a product- or a similar team owns 100% of engineering time, and you have to claw back time to work on the core stuff - and that includes tackling tech debt.

I’m sure you’re all familiar with the concept of tech debt, but just so we’re on the same page here, let’s use the definition I like:

Tech debt is the implied cost of an easy solution over a slower and better approach accumulated over time.

Tech debt examples from daily life

Tech debt is an API that returns a list of results without pagination. You started with twenty results and figured you’d never go over a hundred elements, but three years later, you have thousands of elements in the 10MB response.

It’s the fragile code that everything runs through. You dare not touch it, afraid of breaking things or introducing unexpected behaviors.

It’s the parts of the code base that nobody wants to touch. You know which part of the codebase I’m talking about. Everyone has at least one. If you get a ticket saying something’s wrong with user messaging and everyone backs away from their desk and starts staring at the floor, that’s a measurable cue that it needs focused effort.

It’s the entire system that has become too complex to change or deprecate, and you’re stuck with it because no reasonable amount of effort would be enough to fix or replace it.

It’s also the broken tools and processes, but even worse, it’s a lack of confidence in the build and deploy process. Breakages should not roll out. Good deploys should always deploy successfully. Good builds.

It’s what you wish you could change but can’t afford to for various reasons.

Where does tech debt even come from?

There are a million reasons why tech debt happens and accrues. It comes from insufficient upfront definition, tight coupling of components, lack of attention to the foundations, and evolution over time. When was the last time you had the full requirements for a project before any implementation work began, and they stayed the same until delivery?

It’s when you see a pull request with a hard-coded credential or an API key and request to have no credentials in the code base. You get pushback because there is a cron job from four years ago with the hard-coded SQL connection with the password.

That’s tech debt - people pushing back on good solutions. There are a million excuses for it: “We’re a tiny startup, we can’t afford to have perfect code”; “We’re still trying to prove the product market fit”; or even “When we make some money, things will be different.”

It may even sound reasonable in the beginning, but you don’t want your golden moment, when your business takes off, to also be the moment of complete technical breakdown.

And it’s not just about having bad code, tools, or processes. It’s also about people learning the bad behavior and then passing it on to the new hires. And if the old hires perpetuating the bad ways outnumber the new hires, the newcomers will have no chance to sway you towards the light.

What’s the actual harm?

Unaddressed tech debt breeds more tech debt. Once some amount of it is allowed to survive and thrive, people are more likely to contribute lower-quality solutions and be more offended when asked to improve them.

That’s just human nature. If you have a bad neighborhood, you chuck your gum or a cigarette butt on the floor because it’s already littered and nobody cares. If you’re on a nice street with no garbage and no gum stuck to the pavement, you’re going to behave. You shouldn’t allow your code base or systems to become a bad neighborhood because if you’re surrounded by bad neighbors, why should you be any different?

Over time, it becomes a vicious self-perpetuating cycle. Productivity drops, deadlines slip, and reasons are manifold. It’s difficult to achieve results with confidence. The cognitive load of carefully treading through the existing jungle to make changes is too high. Quality takes a nosedive, and you start to see a clear separation between new and clean stuff, which was just shipped, and older code, which is six or more months old, which is instantly terrible.

You end up with no clear root cause for issues or outages, and there’s nothing that you can pinpoint and say: “This is what we should have done six months ago to avoid this situation.”

The morale among tech staff tanks because they are demotivated from working on things that are horrible; they start avoiding any unnecessary improvements and abandon the Boy Scout rule of leaving things better than you found them.

It’s a death-by-a-thousand-cuts situation, one big pile of sadness for people who are supposed to champion new and exciting work.

Tech bankruptcy: A true story

Just so you don’t think I’ve been exaggerating, let me give you a real-life example of a regular startup, looking for the right product to sell to the right people, with a few years of unchecked tech debt. Nobody there was particularly clueless or evil.

Within days of joining them, my alarms start going off. Very nice people all around, clearly talented engineers, hiring the best from the top schools. But the tech stack and tooling around is so weird I can’t wrap my head around it.

There’s a massive monolithic 10-gigabyte Git repository hosted in the office. It’s very fast for office folks, like a local network, but the company also has remote workers. Some person on a shitty DSL halfway around the world is going to have a very bad day if they buy a new laptop and need to re-clone the repo or something.

There’s no concept of “stable”. Touching anything triggers the rollout of everything because there’s no way for the build server to know what’s affected by any change. The safest thing to do is just roll out the universe, which means one commit takes an hour and a half to deploy. Four commits are your deploy pipeline for the workday.

Rollbacks take just as long because there are no rollbacks, really. You just commit a fix, and it uses the same queue. Let’s say you have four commits by 10 AM, that’s your pipeline for the day. But at noon, something breaks, and you commit a fix immediately. It’s not going out until the end of the day unless you go into the build server and manually kill all the other jobs in the queue. And then you don’t know what else you’re supposed to deploy because you’ve just killed the deploy jobs for the entire universe.

There is also a handcrafted build server, a Jenkins box hosted in the office, but no record of how it’s provisioned or configured. If something were to happen to it, the way you build software would just be lost. Each job on it is subtly different, even for the same tech. You have an Android source code that you build three instances out of, but each of them builds in a different way.

No local dev environments exist, so everyone works directly on production systems. This is a great way to ensure people don’t experiment because they’ll get into trouble just for working on legitimate stuff.

People have to use the VPN for everything, even non-technical stuff, like support and product. A VPN failure becomes a long coffee break for the entire company.

Code that has just been written is hitting the master database, there is no database schema versioning. Changes are done directly on the master database with an honor system accounting, and half the changes just don’t get recorded because people forget. There is no way to tell what the database looked like a month ago, and consequently, there is also no way to have a test staging environment that is the same as production.

Half the servers are not deployable from scratch. This almost guarantees that servers that should be the same are different. You don’t know what the difference is because you have no way to enforce them to be the same. Or even worse, their deployability is unknown or hasn’t been tested, so you can just assume it doesn’t work. The code review tool is a bug-ridden, unsupported, limited, self-hosted abandonware.

It’s like everything people use to develop software constantly enforces some limits. Outages become a daily occurrence, the list of causes too long to mention. And individual outages are just not worthy of a postmortem because there’s no reasonable expectation of uptime.

Everyone is focused on shipping features. And you get that because you can’t just refactor eight years of bad decisions. You start approaching the point of rewrites, which are almost always a bad idea. And every time you skip refactoring to push out a feature and say “Just this once”, it’s another step in the wrong direction.

How do you even begin to fix this?

For obvious reasons, I call this state tech bankruptcy. It’s the point where people don’t even know how to move forward. Every task is big because it takes hours to get into the context of how careful you need to be.

At the time, the infrastructure team was staffed with rebels. They were happy to work in the shadows, with the blessing of a small part of leadership, so I joined their team.

It took us over a year and a half to get to the point where it wasn’t completely terrible. We started by writing everything down - every terrible thing became a ticket. It became a hidden project in our task-tracking system called Monsters Under the Bed, and whenever we’d have a few minutes, we’d open the Monsters, contemplate one of them, and find a novel way to kill it.

The team worked tirelessly to unblock software developers and empower them to ship quality software. Most of the work was done in the shadows, with double accounting for time spent.

The build server was rebuilt from scratch, with Ansible in the cloud, so it can easily be scaled up or migrated. We now had a recipe for the build server, which knows how our software is built and deployed.

Build and deploy jobs were defined in code, with no editing whatsoever via web UI. Since they were defined in code, there was an inheritance, and if there were differences between the builds, you extended that job and defined the differences.

We split the monolithic repo into 40 smaller ones, and even that first iteration was chopped again into even smaller repos. There were three proposals for killing that repo with an all-or-nothing approach that would require us to either pause all development for a week, or cut our losses, lose all history, and start fresh.

Instead, we built an incremental approach, split out a tiny chunk, paused development for a single team for an hour, and moved them to a new repo, with their history intact. Infrastructure went first, showing the path toward the light to other teams. We set up a system where changes triggered a build deploy only on the affected project, and commit to live was measured in seconds, not hours.

Some teams initially opted out and were allowed to stay in the mono repo. They joined the party a few months later, after seeing what the other teams were doing.

All servers were rebuilt and redeployed with Ansible. This used to be some 80 machines with 20 different roles. We did all this under the guise of upgrading the fleet to Ubuntu 16. Nobody understood what that was or asked how long it would take. But whenever someone asked about a server whose name had changed, we would just say: “Oh, it’s the new Ubuntu 16 box”.

In the background, we wrote fresh Ansible to deploy a server that kind of did what we needed it to do and iterated on it until it could actually do what needed to be done. Then we killed the old hand-weaved nonsense and replaced it with our Ansible solution.

We migrated to modern code review software and away from self-hosted Git hosting to GitHub.

VPN was no longer needed for day-to-day work. You only had to connect to the VPN for the master database, and nobody had write access anyway. It’s not useful for dev work, anyway.

We created local dev environments. There was no more reviewing stuff that didn’t even build because people were afraid to touch it, and there was no more running untested code against production. There was now a code review process for SQL scripts, and a method of deploying them that kept the dev, test, and production databases in sync.

Job well done. Take the rest of the week off, right?

There are two morals to this story.

One is not to wait for permission to do your job - it’s always easier to beg for forgiveness anyway. If you see something broken, fix it. If you don’t have time to fix it, write it down, but come back when you can steal a minute. And even if it takes months to make progress, it’s worth doing.

The team here was well aware of how broken things were, but they thought that was the best they could do - and it wasn’t. If we had pushed the change as a single massive project, one that would take a year and a significant number of full-time engineers to have any measurable impact, it would never have happened. The company simply couldn’t afford that. Instead, we turned a small team into a red team and just did it.

The other moral is that it should never have been like this. This is not a playbook on tackling tech debt - it was a horrible way to do it, even if we did manage in the end.

A team in the company was directly subverting the established processes because the processes were failing them. And the managers were giving us the thumbs up and protecting us while pretending to the rest of the world they didn’t know what we were doing. It’s just not how things should be done.

Why no one wants to tackle tech debt

Situations like that happen because tech debt work is very difficult to sell. It’s an immeasurable amount of pain that increases in unmeasurable ways. And if you put in some effort to tackle it, you get unmeasurable gains.

Even the name, tech debt, implies we have control over it - that we can choose to pay it down when it suits us, like a credit card. But it’s not like a credit card, where you can make a payment plan and pay it off in percentages. With tech debt, there’s no number. It’s just a pain index with no upper bound, and it can double by the next time you check your balance.

It’s incredibly difficult to schedule work to address tech debt because nobody explicitly asks for it. Everyone wants something visible and measurable to be achieved, and tackling tech debt directly takes time away from that measurable and visible goal - shipping features.

But to quote a cleaning equipment manufacturer: “If you don’t schedule time for maintenance, your equipment will schedule it for you”. Things that are not regularly maintained will break in the most inopportune time possible.

It’s time for a new approach.

Time to ditch the “tech debt” concept

I recently came across an article that changed how I think about this: Sprints, Marathons, and Root Canals by Gojko Adzic. It suggests that software development is neither a sprint nor a marathon, which is a standard comparison.

Both sprints and marathons have a clearly defined end state - you run them out, and they’re done. Software development, on the other hand, is never done. You only stop working on it because of a higher-priority project or if a company shuts down.

His point is that you don’t put basic hygiene on your calendar, like showering or brushing your teeth; it just happens. You can skip it one time, but you can’t skip it a dozen times, at least not without consequence. Having to schedule means something went horribly wrong, like going in for a root canal instead of just brushing your teeth every morning.

Translated to software development, this is sustainability work. It’s not paying down tech debt but making software development sustainable, so you can keep delivering a healthy amount of software regularly.

Instead of pushing for tech debt sprints or tech debt days, sustainability work needs to become a first-tier work item. Like brushing your teeth, you can skip here and there or delay it for a bit, but the more you do, the more painful the lesson will be.

Agree on a regular budget for sustainability work for every team or every engineer and tweak it over time. It’s a balance, and there’s no magic number. You don’t have to discuss it with people outside the engineering team. Engineers will know which things they keep stumbling over and what their pain points are. Over time, you can discuss the effects and whether they need more time, or if they can give some back.

This approach doesn’t only help improve your code but also your morale - there’s nothing worse than being told you’re not allowed to address something that’s making your life miserable.

So, let’s ditch the term tech debt because it sends the wrong message. Let’s call it sustainability work, make sure it’s scheduled as part of the regular dev process and not something that needs to push out feature work to end up on the schedule.