Tom Barber (00:00)
Welcome to Engineering Evolved, the podcast for companies that have outgrown their startup chaos but aren't quite ready for enterprise bureaucracy. You're too big for duct tape and hope, but you're too small for six-month governance processes. I'm your host, Tom Barber, and here is what I believe. Technical debt isn't a technology problem disguised as a business problem. It's a business problem that happens to live in your code base. And that distinction changes everything about how you solve it.
Today we're talking about three types of technical debt that actually matter to your CFO. Not the code smells that make engineers uncomfortable, not the architectural debates that rage in your pull request The technical debt that shows up as missed revenue, burned payroll dollars, and existential business risk. If you've ever watched an engineering leader struggle to explain why the team needs six months to modernize a system,
or if you're tired of hearing, we're moving as fast as we can while competitors ship circles around you, this episode is going to change how you think about these conversations. So let's dig in.
So for this episode, I've done a bit of digging and I came up with a number which was that technical debt is costing American companies $1.5 trillion every year. And most executives of course have no idea that this is even happening. So to put this into context, that's more than the GDP of all but 10 countries worldwide. $1.5 trillion in technical debt.
⁓ And the problem is, course, with technical debt is by and large, it's an invisible concept. C-suite execs, managers inside of organizations do not understand what technical debt is or the problems that might arise with technical debt expense. But it's hiding in every organization and the place it hides the best is inside of your engineering payroll.
because it's buried in the deals that you're losing to faster competitors. It's lurking in your risk exposure like a ticking time bomb. And we've seen what happens when that bomb goes off. So a couple of examples I'll dig into later in the show. Knight Capital, a company that took 17 years to build, lost $440 million in 45 minutes because of some dead code that they left in their systems when they deployed an upgrade. 45 minutes.
completely dead. Southwest Airlines in 2022 managed to knock their entire customer booking system offline that cost them well over a billion dollars and stranded two million people over the holiday period. TSB Bank tried to migrate their systems and the whole thing imploded, cost them hundreds of millions of pounds and lost tens of thousands of customers and both the CEO and the CIO were out of their jobs.
Now, of course, these weren't flukes. These weren't completely unavoidable. They weren't bad luck. They were organizations that decided to punt on the technical debt to try and save the money in the short term and then paid out 10 times that amount when everything collapsed in the future. So here's my position. It's not subtle. It's not unsurprising. But technical debt isn't like an if problem. It's a when problem. And the question isn't
whether it'll bite you. The question is whether you're going to manage it proactively or pay the full price when it finally explodes. So let's get into it.
I'm sure you've all been there and you've all been in meetings with C-suite execs and that type of thing where you're trying to explain a technical problem to non-technical people and you can see their eyes glaze over within 90 seconds of the meeting starting. And it's, course, not because CFOs are stupid people. It's because we don't speak as technologists a language that they understand. So.
Trying to explain microservice architectures or DevOps enhancements to a CFO is going to get you almost nowhere. It's like trying to sell someone a car by explaining the alloy properties in the engine. mean, technically it's accurate. In reality, it's completely useless because your CFO doesn't care about the debt. They care about what deals that they're going to get, what...
you know, what money is being spent on payroll because you keep hiring more people to deal with the velocity increases and they care about whether your systems are going to fail catastrophically and make the lead story on CNBC. I've seen this at NASA. When I was working at NASA and we saw budget cuts come in, these systems that we built, the systems that we spent time curating, managing, enhancing,
Some of them which have been around since the seventies, if you think Voyager and that type of thing that's still running, budget cuts would come in. And of course it's trying to figure out like who's going to get cut, how many people are going to get like, you know, reallocated or like relieved to their duties, where these systems are going to continue to run. And, you know, when funding got tight, security patches would stop, maintenance would stop, the systems would keep running, but the risk keeps growing.
And then eventually they stop running and people are surprised when they stop running. Every budget cycle, we'd fight for extra investment with technical arguments. We'd talk about end of life operating systems and unpatched vulnerabilities. Cause some of this stuff is running on my eighties and nineties hardware still. Executives would nod politely and then fund other things inside of the organization that would have clear
business value or be able to secure more grants and funding down the line. We were speaking engineer when we needed to speak executive. So this is what we're going to approach today. I'm going to show you three types of technical debt, technical debt language that actually gets budget approved because technical debt absolutely is a business problem. Don't forget. It's just, we often don't frame it that way. And this is what as engineering leaders, we need to be better at doing.
So we'll start with an easy one. There is a type of technical debt that matters to your business because that is revenue blockers. It's technical debt that's stopping your business actually making money. You've got, for example, customers coming in asking for features. You've got customers or potential customers with deals on the table. You've got marketing opportunities that staring you in the face and you can't
build towards any of that because you can't support the requests that are coming in from your existing systems. And so every day that your, you know, you, the, debt exists, you're watching the revenue. I basically walk out the door and into your competitors hands.
So I have actually seen some of this play out in various startups where I've been before. For example, startup I worked at previously, we had a beautiful product roadmap and customers were telling us what they wanted and we had enterprise prospects ready to sign the moment we could support their compliance requirements, but we couldn't ship any of it. Why? Because we built the whole thing during the move fast and break things phase.
Our core system, which was Spark-based, was heavy on the duct tape and glue that was keeping everything together, and everyone was terrified to touch it. So every time they asked for a new feature, it meant diving into this code base, which was spaghetti code, and you were absolutely petrified of breaking something the existing customer was already using.
Our deployment process was this multi-hour thing where we built these packages that were bespoke per customer and just super hard to maintain. And that required coordination across the entire team. And so we obviously went shipping from like shipping features and changes and improvements on a daily basis to like once a month. And it kept on decreasing just because of the complexity of this code base and the way that we were dealing with the deployment. And.
Like the really hard part was we had like, you know, a pipeline sitting there. We had these customers that were asking for specific features or waiting to be on boarded, but the, ability to be able to service them was greatly inhibited by the technical debt that we had in the system.
And of course, those deals would have been transformative to a small startup. we watched them disappear as competitors moved in and took them away. so, you know, technical debt remediation would have been painful and expensive. But we spent way more time trying to build on top of this crumbling foundation. In the end, we lost those deals anyway. The savings we got from not modernizing cost us multiples of that in lost revenue. And then we rebuilt it anyway. So there you go.
But this pattern is everywhere in mid-market companies. You're past that startup phase where you can just hack things together. You're not big enough to have a dedicated team of people that you can just apply the modernization team. So you're stuck in this middle ground where you've got technical debt from your startup days still in production. You've got customers with real requirements. And you're getting squeezed in every direction.
And the symptom though is always the same because you get harassed by the sales team or your senior leadership that are coming to ask you where the features are. The timelines get like slip, your product team's backlog is full of things that customers need, but engineering say will take quarters to build. And so everyone gets annoyed with each other.
And no one can understand why things are going so slow, but you end up with more meetings, less velocity, everyone's doing different things. And so it ends up being this trap that you find yourself in. And so, you know, a company finds market fit. They raise enough cash to, you know, be comfortable and do what they want to do. And then they pour into like hiring more people and shipping more features. But what should they actually do? Well, they should really just take a chunk of that
And before they go and hire anybody, take the existing developers who understand that code base and sit down and figure out what the technical debt is and what they should do to be able to fix it up. Because you go from 12 to 45 engineers and expect productivity to triple, but it doesn't. It might go up 40 % if you're lucky. You know, it's like they've hired 33 people and gotten maybe five people's worth of additional output.
because you've got all this debt, you're all working on the same code base, takes ages to onboard people, people are breaking each other's features. And that technical debt was manageable when you got 12 people, but it comes crippling when you've got 45.
When you've got limited budget and limited time, you can't fix everything. So when should revenue blockers be your priority? Well, when you're losing deals specifically because you can't ship features fast enough, when your time to market is measurably slower than your competitors, when strategic initiatives are getting blocked by technical constraints. And yeah, if you're in a hyper competitive market, where feature velocity determines market share, then this is your problem. But if you're in a pre product market fit, don't prioritize this at all.
fact, why you listening to me? You're listening to the wrong podcast. But that aside, welcome. Speed of learning matters more than code quality when you're still figuring out what to build. And strategic debt is fine. It's expected. But this mistake is keeping that strategic debt around after you've found product market fit. And that's when it turns from an asset into an anchor. And so you need to understand when is the right time to pay down that debt.
So moving on, we're gonna talk about operational drain. Cause this is the type of technical debt that impacts you and is killing you slowly every single day. And this is the tax that you pay just for keeping the lights on at home. While revenue blockers cost you opportunities you never capture, operational drain is burning your engineering budget on maintenance and firefighting instead of building new value. Stripe did a massive
study across thousands of developers and executives in different countries. And they found that developers spent 42 % of their week working on technical debt and maintenance. So once you've like, like dragged yourself up off the floor, you know, you're paying full-time engineers and nearly half of their time is just keeping things from falling apart rather than moving the business forward.
And there's another study that was in Scandinavia where researchers tracked developers over seven weeks with regular check-ins and they found developers waste 23 % of their time specifically on technical debt. So it's a bit of a sliding scale, but you can see that like, regardless of where people are going to ask the question, people are still spending a large chunk of their work week just dealing with technical debt. Of course,
The thing that should actually terrify an executive is that in quarter of cases, working around existing technical debt forces you to create new technical debt. And so it becomes a vicious cycle. And the more debt you have, the more debt you create just trying to work with it.
Of course, and this is what I said at the start, managers do not necessarily understand what the technical debt is or what technical debt even means. And so if you can't quantify it, a manager is never going to understand it. And so it's like an invisible tax. Your team knows that they're drowning. Your engineering managers see the symptoms. Like with, I don't know, sprint commitments, code quality issues and declining velocity.
But the leadership doesn't see it because nobody's translating this code as a mess into, we're setting a third of our payroll on fire and everyone's spending their time trying to keep the lights on. And you need to be able to translate that into CFO, CEO speak so that they can then appreciate the amount of time and effort it takes to keep the lights on.
I mean, this is what it might look like in a company your size. You're growing, you're hiring, leadership is excited because you're finally getting the resources to scale. And so you double your engineering team and everyone expects output to double or close at least. But of course it doesn't because for example, know, company that I've worked with went through exactly this, doubled the team size, expecting productivity to roughly double. Instead it went up maybe 30 or 40%. Why?
because all the new people walked into this environment where the deployment was manual and fragile, where there was minimal automated testing. So every change needed extensive manual QA and where critical systems had knowledge silos. Only one or two people understood how they worked creating bottlenecks, especially when people are going on holiday or otherwise out the office. So new engineers, we brought all these new engineers in ready to contribute, but six months later, they spent most of their time
firefighting with the build system or trying to figure out why the tests weren't working, you know, and navigating various issues in the code base itself that everyone was too afraid to touch. I'm sure we've all been there. We had similar at NASA as well, where, you know, things were put together in a rather ad hoc manner and people are just frightened to touch their production systems.
And so, you know, in this case, the company I was working with at the time doubled their engineering expense and got maybe 35 % more output. But that's not a scaling problem. That was a technical debt problem because people didn't know what to do once they arrived. People who did know what they would do were too busy fighting fires to be able to upskill people properly.
And so if you're trying to sell this to a CFO, let's try and make this concrete. So say you've got 50 engineers, an average salary of $120,000. So that's $6 million annual engineering budget. If a third of that capacity was then being consumed by technical debt maintenance, you're burning $2 million a year on unproductive work every year until you fix it.
But here's how you should actually think about it. That $2 million isn't just wasted money. It's wasted capacity because it's the equivalent of having 16 senior engineers on your payroll who produce literally nothing. Dead code might as well not be there. No features, no improvements. They just stop the thing blowing up. And so unlike a bad hire, you can fix. This doesn't get better on its own. It gets worse. The debt compounds, the combated...
the code base gets more complex, especially as people try and patches and patches and patches. And next year that 33 % becomes 40%.
So if you actually have to pitch this, here's a pitch that works with finance teams because you can invest in modernization once or you can keep paying the operational tax forever. So let's say the modernization tax costs a couple of million dollars and takes six to nine months. So yeah, it's painful. Like it's obviously an investment, but you're currently burning $2 million annually in wasted productivity. So you break even in a year. And after that,
assuming that you keep on top of the technical debt, it's pure savings forever. Now, of course, it's unrealistic. You're always going to remain on top of technical debt, but at least you can start from ground zero again. But don't frame it as cost savings. Frame it as capacity expansion. So you can say to your CFO, for example, this modernization liberates a third of our engineering capacity.
We're not going to spend money. We're asking to unlock the equivalent of 16 senior developers without recruiting costs, equity dilution or management overhead. And that's the pitch that gets this type of budget approved.
So here's something that doesn't get talked about enough, which is technical debt is a retention issue. Of course, there's research showing that over 50 % of engineers have left or seriously considered leaving a job specifically because of technical debt burden. Think about what that means. You're spending six figures per engineer to recruit on board and ramp them up and they're leaving because the code base is...
so painful to work with that it's not worth their salary. So now you're paying recruiting fees, you're dealing with the knowledge loss and you're stuck in this cycle where the debt makes people leave, which creates more knowledge silos, which makes the debt worse. So once again, it's a compounding problem and it's not just an efficiency problem. It's an existential threat to your ability to retain good people. And of course, a lot of hiring goes through networks and referrals and all that type of stuff. So...
If you get a bad rep, it's probably not going to help your ability to hire, especially when everybody knows everybody in this type of industry.
So when do you prioritize operational drain over everything else? Well, when your team is growing, but your productivity isn't. When you're doubling headcount and seeing 40 % improvement to save and doubling output. When your spring commitments are constantly missed despite reasonable estimates. And when your developers are explicitly telling you that your code base is killing morale. At that point, you should probably do something about it. And the signal itself is unmistakable. If adding people isn't adding proportional productivity,
Operational drain is your problem. And here's the thing about mid-market companies. You can't just throw bodies at the problem like an enterprise can because you need every engineer producing. You can't afford to have a third of your capacity lost to technical debt. So fix this and suddenly your current team can do 50 % more work. And that's better than any hiring plan out there.
So we'll get onto the third risk, is catastrophic risk. And this is technical debt that creates like proper existential business risk. Like your business may not exist if this actually comes to bear. This may be system failures, security breaches, regulatory violations or operational meltdowns. And this is the debt that shows up in those like SEC filings, regulatory fines.
crisis management war rooms and probably the front page of the Financial Times. So at the start I mentioned a couple of examples. We're going to run through these. The first one was Knight Capital, which on August the 1st, 2012, Knight Capital was at the time the largest equity trader in the US with 17 % of the market share built over 17 years. At 9.30 in the morning, the market opened.
At 10.15, Knight Capital ceased to exist anymore. And so what happened? 2003, dead code, well, dead code from 2003 called Powerpeg had been left in production for eight years. When deploying their updated trading platform to the servers, I think they had eight of them, seven of them accepted it, one of them failed.
and a repurposed configuration flag accidentally activated the old algorithm. And so that one server started executing errorless trades, buying high, selling low repeatedly, 4 million trades in 45 minutes. And so the result of that was a $440 million loss. Stock price dropped 75 % in one day. It required a $400 million emergency bailout.
eventually acquired by a rival ending 17 years of independence and top it off a $12 million SEC fine. And of course, as you can guess, because this is the episode we're talking about it, the root cause of pure technical debt, dead code, manual deployment processes, no automated verification, no kill switches, no circuit breakers and repurpose configuration flags. So it wasn't a technology failure. The technology worked fine. It worked as it was designed to work. This was a debt failure.
and then deferred the investment in proper deployment automation and test coverage and code hygiene for years. It was fine until it wasn't. And then 45 minutes later, the company was dead.
In December 2022, Southwest Airlines crew's scheduling system, which was built in the 1990s, collapsed under weather disruption. And so the financial impact of that was 825 million to 1.1 billion in direct losses to Southwest Airlines, 140 million DOT fine, which was the largest consumer protection penalty in US history, 30 times the previous record, nearly 17,000 flights canceled, 2 million passengers stranded,
and 80,000 customers lost permanently because they just couldn't be bothered and went elsewhere. Now, the system couldn't handle the real-time data integration at the scale that was required. So when weather disruptions hit on December 21st and 22nd, other airlines recovered in 24 hours. Southwest took a week. Now, the thing was that Southwest Airlines already knew that they had a catastrophic risk issue because it was flagged internally in 2018. The pilot union
warned in November 2022 of that year that the system was one IT route of failure away from complete meltdown. And so the senior leadership knew it, but they deferred the modernization. They reduced full-time tech workers by 27 % in that time frame, and they saved maybe 50 to 100 million in modernization costs. But of course, it costs them $1.2 billion when it fails.
At NASA, when the budget cuts hit, the calculus also became brutal. you have like systems that are 40 to 50 years old. Like I mentioned, we were supporting the Voyager platform for a while. Critical to mission success. Running on hardware that hasn't been manufactured in literally decades. But when the funding runs out or nearly runs out, security patches stop, maintenance stops, documentation stops. The systems keep running, but the risk exposure keeps growing exponentially.
And it's not that anyone wants to actually not do the maintenance, it's the budget realities forcing possible choices. And over time, the technical debt accumulates and accumulates until you've got one power supply failure away from losing a billion dollars spacecraft. In 2019, NASA's OIG found 5,406 unresolved security patches with 86 % rated high or critical severity. Not because NASA doesn't care about security, clearly.
I've worked on a number of projects that increase, improve the cyber security posture within side of NASA. But the systems are so old and complex, that patching them is nearly impossible without complete modernization. Some of which we were budgeted to do and some of which we were not.
So when should you prioritize catastrophic risk? Well, it should be your number one priority when you're in a regulated industry with significant fine exposure, when you handle sensitive customer data, when you have revenue critical systems with single points of failure, when you're running end of life software with unpatched vulnerabilities, or you're preparing for an acquisition or IPO, or you...
You've had a number of near-miss incidents with close calls. Don't wait until you're Southwest Airlines. Don't wait until you're a financial services company doing automated trades that goes south quickly. Pay down that catastrophic risk debt before it pays you an unexpected visit.
So you've got three types of technical debt. And so how do you prioritize? Here's a good little way of doing it. So if you've got business impact and remediation cost times that by risk reduction over time to complete and run this calculation for every major piece of technical debt in your portfolio, then stack rank. so generally catastrophic risk with low remediation cost, fix it immediately. Revenue blockers would clear ROI, fix next. And then operational drain with high ongoing costs you can fix systematically.
But context, of course, also matters here. If you're in a hyper competitive market losing deals daily, then revenue blockers are probably higher on the list of importance. If you're in a regulated industry which has security vulnerabilities that might expose you, catastrophic risk is number one. So don't try and fix everything. Just fix the debt that has the highest business impact per dollar spent.
A way to be able to like actually make this viable is an implementation strategy called the 20 % rule. So allocate 20 % of engineering capacity to technical debt restriction, not as debt sprints that complete with features, but build it into how you work. Jean Kim from the DevOps Handbook recommends this, McKinsey found also top performing organizations allocate 10 to 20 % minimum and shot the Spotify, no.
Shopify uses a 25 % rule. And so you're tripling your feature delivery capacity by systematically paying down debt. But here's the trick. When you're trying to sell it upstairs, when you're trying to get the C-suite execs to sign it off, don't call it technical debt work because they'll ask you why it's there in the first place. Call it future velocity investment or capacity expansion. Frame it in business terms.
and then build it into feature delivery. When you ship a feature, also modernize the subsystems. This adds 20 % to short-term delivery time, but accelerates future work by 40%.
So wrapping this episode up, technical debt is compound interest working against you. Every quarter you defer, the interest payment grows. You have three choices. Maintain the current course and pay, I don't know, 8.5 million over three years in wasted productivity and opportunity cost with increasing risk. Choice number two is strategic investment and spend 3 million now, generate 4.2 million in net benefit over the next three years and dramatically reduce risk. Well, choice three.
Do nothing and hope that you're not the next Southwest Airlines or Night Knight Capital.
The companies that treat technical debt as strategic liability requiring active management are the ones that still, still competitive five years from now. The companies that don't, they're the cautionary tales end up in podcasts like this one. So your technology estate is either appreciating or depreciating. Technical debt is the interest rate. The only question is, are you paying it or is it paying you?
So that's it for this episode of Engineering Evolved. If this episode helped you think differently about technical debt or if you're going to use in any of the frameworks in your next budget conversation, send me a message. I'd love to hear how it goes. And if you know a CTO or a VP of engineering who's struggling to get funding for modernization, share this episode with them because the language matters. Join me again in the next episode and until then.
Fly your flag and remember, action trumps perfection. We'll talk soon.