In a previous post, I covered managing technical debt at at early / fast-growing startups. As companies grow – in head count, in users, in infrastructure, in complexity – the same strategies will be insufficient. In this post let's look at how to operationalize management of technical debt at scale.
I'm using the term "technical debt" very loosely here; in the context of this post, the "baggage" is basically anything that's not directly contributing to the top/bottom line. Here are just some examples of situations you may find yourself in:
- Any number of projects that are important, but not urgent (like investing in that better logging pipeline your engineers have been wanting to do)
- Managing physical infrastructure? The 3-yr hardware refresh cycle will force constant service migrations
- If it's not hardware, it might be OS upgrades (anyone else have trauma from Ubuntu 12 to Ubuntu 16 LTS upgrades?)
- Or it might be hardware bugs (cpu.fail)
- Or firmware bugs. Or bugs / performance issues anywhere up and down your stack. Or it might be a legacy codebase undergoing a years long "rewrite". Or a multi-quarter project to move away from single DB instance to some distributed solution. Or migrating to a new build / CI / CD system or unit-testing framework. Or switching to the latest front-end development stack.
You get the idea. Just like at scale, you design software / infrastructure / systems with the expectation that failures are the norm, not the exception, so must you design your organization and processes with the expectation that there's a constant churn of technical debt related projects at all times.
So, how do you do it? Here are some suggestions.
Build out a TPM organization
TPM in this context stands for Technical Program Managers. TPMs are usually highly organized, excellent communicators who drive organizational leverage by designing processes and driving programs. Any large-scale migration / technical-debt project will likely require a lot of communication and coordination across teams and functions, over months or even quarters. Having a dedicated TPM on such projects is crucial for success.
Further reading: check out this TPM learning path on Lynda.
Build Reliability Muscle
Along similar lines, invest in building organizational muscle for Reliability. Specifically:
- A well designed and practical incident response process. Check out Pagerduty's documentation for a good starting point. And the "Managing Incidents" chapter from the Google SRE book.
- Regular education about said process, including training for various roles like Incident Manager, on-call rotations.
- Empower and enable teams to customize and setup their own on-call rotations.
- Educate and align the leadership team on the above processes, especially when you need to mobilize resources from across the company in a pinch. For instance, at many tech companies (e.g. Google, Dropbox), a special "Code Red" is invoked when there's an imminent material risk to the business.
At scale, your company will likely find itself in the middle of several major migrations at any given point in time. It's vital to not make any single migration "a big deal"; instead, invest in making migrations just a normal part of engineering.
- Create consistent narratives: articulate the business impact of the migration. Why does it matter and why does it need to happen now? Make trade-offs explicit.
- Make them easy: invest in documentation, tooling and automation. Migrations are far easier to scale if teams are empowered to do them on their own.
- Create accountability: For complex migrations, make a single team explicitly accountable for driving the migrations to completion. Leverage TPMs for coordination and status updates. One of the longest successful migrations I saw lasted almost two years! Without clear accountability, such long migrations would languish forever.
- Recognize impact: you should not be undertaking any migration whose business impact is unclear. Conversely, reward migrations by celebrating victories (including significant intermediate milestones) and reminding the org of the business impact upon completion. Follow through with recognizing and rewarding this impact in your performance reviews – people notice who gets promoted and if there's even a suspicion that only folks working on shiny new features get promoted, you'll have some perverse incentives to correct.
- Budget for them: make sure your planning processes account for planned and unexpected migrations. Everything from OKRs to headcount planning to budget (capex / opex) should explicitly factor in migration work.
Further reading: Will Larson's post on Migrations
As the organization grows, the amount of energy devoted to addressing technical debt should hopefully approach an asymptotic constant – if you force the curve down over time, even better!