Escaping Technical Purgatory
play Play pause Pause
S1 E1

Escaping Technical Purgatory

play Play pause Pause

Tom Barber (00:00)
Welcome. Let me give you a quick scenario. VP of Engineering walks into a decent sized company on a Monday morning and they have a proposal. They've been reading all the blogs, following all the right thought leaders, and now they want to adopt microservices. They want to implement radical agility. They want to copy what Netflix does. They want to copy what Spotify does or Google does.

And here's what happens six months later, they're deeper in a hole than when they started. The deployment times have got longer, not shorter. Their bug rate has gone up. Half of their engineering team is threatening to quit. And the leadership is asking very pointed questions about why they just spent half a million dollars to make things worse. But this isn't a story about incompetence. This is a story about companies in the missing middle, companies with 200 to 1000 employees who are trying to follow the playbooks that were never written for them. And so today,

We're going to talk about why you're stuck, why the advice you're getting is wrong, and what you actually need to do about it. Let's get into it.

So welcome. If you're a CTO, VP of engineering or a technical leader at a company with 200 to a thousand employees, this is the podcast for you. Not for the startup founders who are still figuring out product market fit, not for the enterprise architects who are managing teams of 500 engineers. For you, the people stuck in the middle. This is Engineering Evolved. My name is Tom. This is episode number one, and we're going to look at why midsize companies are stuck in technical purgatory.

and how to escape it.

Now, let me tell you about a challenge that is very real and why it's so hard for people to be able to fight their way out of it. You're probably somewhere between 200 and 1,000 employees and you've got a tech team of maybe 20 to 150 people that you're expected to use to be able to support the users inside of your organization. You've achieved product market fit and you're generating real revenue.

You're not a startup anymore, but you're also not IBM. And the problem is that this ends up being a position where you end up like basically in hell because you've got tech debt that's just suffocating. You've got the MVP that you built three years ago. It's now your core product. Don't ask me how I know this, but it's being held together with basically a load of duct tape, some insufficient documentation and prayers and good thoughts from the engineering team involved.

You've got business logic that's stored in procedures where they should never have been stored. And only one person is actually thinking about how that operates. And that person is also considering quitting. Then you've got the hiring trap. The hiring trap is also very real. You can't afford 500k a year Google engineers, but you also can't hire the juniors because they don't have enough insight into how your organization runs.

And so you're stuck with people who are okay. The people are pretty good. And they're interested in leaving every 18 months because they find a different role somewhere else that's sort of better suited for where they want to go.

You've got a process that also slows you down. You've got a scrum that was implemented because you're supposed to implement scrum or some agile methodology. Two weeks sprints take three weeks because this is just what happens when you add more things to a sprint. Retros are more like therapy sessions where everyone sits around and commiserates or figures out what they haven't achieved rather than the positives that have come from it. What you actually end up finding is that

nothing actually changes. And so, what you end up with from a management layer is that you've got your engineering managers who end up coming from the best individual contributors. They got promoted because they were good at writing code, but they're not necessarily good managers. And now they're drowning and everyone's making up as they go along. And so it ends up being a trap. You end up stuck because

the company is successful enough that it won't die, but you're struggling enough because you'll never grow. And so you're stuck. And the reason that you end up in this is because of the decisions that were made back at the start. So we're going to look at how to get yourself out of that and how to get yourself onto the straight and narrow to allow for more seamless development cycles, the ability for you as an organization to grow, expand,

and get to that enterprise level without killing yourselves in the process.

So moving on, we're now going to look at why startup methodologies fail at scale. And this is going to be pretty specific, but I want to demonstrate to you why these things don't scale as you get, as your company scales, the startup methodologies hold you back. And the brutal truth is that you can't have your startup speed back. Trying to do this will end up destroying your company. Startups are designed to

break things and fix them quickly. They're designed to test out ideas. You've got a small user base. It doesn't matter too much. And so if you break stuff and you've got 100 users, you fix it, you send out an email apologizing and you move on. If you've got 20,000 users on your system and you're running some SaaS platform in a mid-scale startup, you've got...

an engineering problem there that if you break things, you're going to break things for a much larger user base. Support ends up overwhelmed and accounts get lost because people get frustrated and leave. And there's nothing worse, especially as you're trying to grow, than Googling your own company name and finding it in a Reddit thread where someone's complaining about the service that they've got whilst using your platform. And the problem is you can't...

test your way out of poorly architected systems. At the end of the day, if you have a test suite that's trying to build resilience into your platform, 100 % coverage on a house of cards is still a house of cards. And so as you build these systems out, the testing and the design of the system becomes more more flaky and it becomes harder to pull yourself back.

So another sort of real example that I've worked through over the years is like you end up in a engineering team. You've got maybe 50 engineers. Maybe everyone works remotely. You've got contractors that come and go. You've got this distributed team because this is how tech companies work these days.

The problem is with that is you've got hallway conversations that break down completely. The water cooler talk that you would get ordinarily just doesn't exist anymore. so the communication problem has become more more prevalent over the coming years, over the previous years. And so what you end up with is tribal knowledge. You've got groups of people around your organization who know specific things about specific systems.

Everyone has different understandings about how these things work or how they should work or how they were designed to work. The person that designed it has quip. The people that are maintaining it sort of understand it. so you end up with a lot of sort of abject chaos around the systems that have been built. They're not documented properly and the knowledge has sort of left the organization as the thing is scaled. Features.

work in isolation, but they also create chaos when they interact because of the fact that they've not been scoped out properly and people haven't got together to communicate the issues that they have faced. And so what do you need to be able to like deal with this? You you need documentation. First and foremost, you need documentation. And it's often forgotten about in the mad dash to be able to get stuff into production. But that's obviously key to the way that organizations should work.

But you also need architectural decision records. And the reason for this is it's very easy to make a decision that impacts a vast swathe of either your users or the architecture that underpins them. But why was that decision made? It doesn't necessarily mean it's a negative decision. You can obviously have positive architectural decisions that are going to change the way that the platform works. But what was the thinking behind that and who made those decisions?

Did anyone else sign off that decision? Was it someone just going rogue? In the future, you need to be able to look back at when these decisions were made and make sure that you can track them. You've also got ownership models. Like we were talking a minute ago about tribal knowledge and understanding how these systems work. As your company scales, you're invariably going to have different groups owning different parts of the architecture.

But you need to know who owns them and what the boundaries are so that then you can implement a methodology that's going to work for you and your organization. If you're looking at that from a team interaction perspective, books like Team Topologies are excellent because they explain not only the interactions between the different systems, but also the interactions between the different teams and the API interface that those teams should have so that they can hand things off more effectively to other people in your organization.

As you scale, this becomes more more important. But of course, the startup playbook says that documentation is waste. And so this is where the trade off, you need to catch up on the legacy stuff whilst also implementing things going forward.

Then you've got a dependencies problem and this is dependencies at team scale, not dependencies inside of your software scale. And so you end up with a scenario. So you've got five to 15 teams. So you're building out a microservice platform. You've got five to 15 teams all depending on each other. And so you've got teams like A, B and C all waiting to update for team D to update the API. Team A then

builds their stuff one way, team B builds it a different way. The dependencies between the different organizations, the different teams and groups within side of your organization, it is not clear who's building what, why, and in which way, unless of course you've got dependency management and sequencing that allows for the teams to be able to understand this. Cause nobody knows who's blocked on what.

This brings in sort of more constructive architectural governance and a hierarchy that makes sense as the team grows. Someone needs to say, build a singular platform, not five different versions of things that stand alone. Building that single platform will give you a coherent structure and understanding of which teams do what.

how those dependencies between the teams play out and who to go and ask when there's a requirement to be able to change things.

Then from a startup perspective, you're still looking at technical debt. Obviously, technical debt is a big thing, especially when it comes to startups are moving forward fast and iterating through stuff. And so you have many different types of debt. Everyone sort of thinks about it as being a code problem. And so you end up with a certain amount of code debt. Every project ends up with code debt.

it is a guarantee that you will end up with some and you have to figure out how to go back and deal with it. And you can't necessarily continue refactoring old code while shipping newer features, but you also can't rewrite the whole thing from scratch. And so there ends up being a trade-off with code debt that you have to be able to balance out, which allows for the developers to go back and spend time deliberately going through that debt.

making sure that they've got deliberate debt cleanup time that's not just like 20 % of, you know, 20 % of the time that's allocated to them to clean up debt. They need to have like a sprint that lets them set like cleanup of debt. That's basically it. And you can get through an awful lot as long as you can figure out what bits of debt you're gonna tackle in which sprint.

The other one I mentioned earlier as well is like the documentation debt. And so you think about a scenario where you're trying to implement an authentication system and someone says, well, how does it work? And someone's like, well, go and talk to the developer that has implemented it. So the developer goes and asks, Sarah, the 47th time, how to implement security.

authentication, inside of a microservice. She's quite sick of explaining it 47 times. She's actually just basically sitting there updating her resume. Or to make it actually worse, she gets so sick and tired of being asked how authentication works every time that she's actually just left. And now you're then asked to reverse engineer a system completely on your own. No understanding of how it works. It fails silently, slowly. And obviously, if you're dealing with a platform that is

external facing, it can also be an expensive failure. And so making sure that your critical paths, your critical processes are documented is important. And again, it's like test coverage, which we'll get onto. It doesn't have to be a hundred percent documentation coverage, but you need to make sure that when you're building this stuff out, that you've got, a decent amount of documentation critical path coverage.

and that will set you up nicely for the future.

Then you've got process debt

Now, process debt is definitely a thing, especially as groups start to expand and teams become more self-sufficient. So for example, you might end up with a process where team A deems code reviews optional. They are not optional, but like let's just say team A thinks code reviews are optional. Team B says you need three approvals.

Team C doesn't know code reviews even exist. And so like all your code inside of your organization ends up being built to different standards and methodologies. And that's confusing, not only for the developers, but also for the testers or the people that have to be able to sign this stuff off. Like has my code been reviewed? Has the code been reviewed? No one will actually know until you get those policies and procedures in place. And these process debts.

definitely become more acute as your organization grows. They also exist for like deployment schedules and instant responses where different teams decide that they're gonna deploy their code on different days of the week, times of the year, different cadences, all those types of things. And making sure that you've got a process in place that makes sense for your organization is gonna be something that's important.

Moving on, now we've got architecture debt. Now, I don't know how many people have seen stuff like this, but of course, if you end up joining a growing company, you're gonna find times where you've got multiple different platforms, procedures, ways of doing stuff in place. And this is the architecture debt. You end up with,

A microservice that implements a REST interface, you've got a different microservice that implements GraphQL. One implements gRPC and one using SOAP because the guy's apparently from the 80s and really enjoys using SOAP interfaces. And again, like, it comes down to things like databases. You end up with Postgres, MySQL, MongoDB, all just deployed throughout your architecture because...

the teams have grown up using silos and they've picked the stuff that felt more comfortable to that team, not because the architecture had been thought through properly. You end up with like critical data sat in Google Sheets as opposed to inside of a database. And you've got deployment chaos that involves Kubernetes, Heroku, bare EC2, because some guy just likes SSHing into a box to like, know, unzip a tarball in there. And you've got mysterious Jenkins job.

that no one dares touch because no one actually knows how Jenkins works anymore. Now everyone uses GitHub actions. And so like when you end up with this technical architecture debt, you end up in a scenario where you've got this sprawling architecture that always costs more, is always harder to maintain and is always more flaky because you'll split across so many different services. And that also leads directly to the tribal knowledge aspect of it, where a single team will understand

the capabilities of their team itself, but like not the rest of the organization. And so you end up with these like disjointed architectures that you have to then engineer yourself out of.

Number five, we've got testing debt. I've seen this enough times. You end up with every deploy being utterly terrifying. Someone pushes the go button and then the next question is like, what broke? And so without the ability to run tests, either automated or manual, it becomes a really, really troubling time to be able to deploy those types of stuff.

And so making sure that you have the ability to do that is something that you don't necessarily use when you're a startup. But as you start to scale is something that, testing becomes paramount because otherwise features take, twice the time to build. And then you've got someone manually testing every time you deploy it. And that's like, slow going. And you also end up with just engineers who are super afraid to actually touch systems because they're worried that they're going to change something and then just make the whole thing blow up. And so.

taking that level of testing and then turning it into something like, again, it becomes a process. end up with developers need to be able to write tests. They also need to be given the time to write the tests because that's often not factored in. But then you need to also have like a test engineer function that validates that those tests are working, but also does the manual testing. Like normally products will undergo, both automated testing and manual testing.

to make sure that they're gonna work both from a standalone unit test perspective and also from an integration. You need to make sure all of that is accounted for.

Then you've got knowledge debt, which is like missing context. and this goes back to, the architectural decision records, for example, like, why are we using this architecture? Was it designed that way? Did people put thought and effort into like designing it, or was it just like that when you started? And if it was just like that, when you started, is it still applicable to what you're trying to do today? then why have you built a feature? And this is why.

like JIRA tickets and issue tracking, all that type of stuff is important because you don't understand the context, the business rationale unless of course it's captured somewhere. So making sure that that documentation exists, make sure it's accessible and making sure that people know where to look for it is also important. As is like putting some of this information into the code. So when you're looking at the code, you understand why, what ticket was assigned to or what the sprint goal was or that type of thing.

And also people just don't know what's safe to change and what's not. Like what's actually in use, what's not in use, what's used in which components, all those types of things. it's a compounding effect. All the debts make each other debt worse. If you've got missing docs, you can't onboard properly. You can't scale the team. The engineers have to answer all the questions because no one knows what's going on. Leads to burnout. People leave. You lose knowledge. You have more knowledge debt. Like it's...

It's a self-fulfilling prophecy and compounding in itself. You've also got like missing process, different builds, architecture debt, harder maintenance, code debt, harder testing, testing debt. You end up in this death spiral of operation and it's painful to see, it's painful to be part of and something that no engineer wants to be involved in. And so if you want to keep your tribe happy,

making sure that you have the ability to be able to dig them out of that hole is paramount.

Okay, now we're going to move on to why you can't go completely the other way. So we're going to look at enterprise engineering playbooks and why this will also kill your company. So you can't use startup playbooks because your company's too big and you ended up with too much debt. You also can't use playbooks designed for 10,000 to 100,000 employees when you've got 50 to 100 engineers. It doesn't scale the other way either.

Um, you know, there's, there's times where I've worked with organizations that have, there's small to medium sized organizations, 300 people companies, for example, and they've been wanting to implement change management. So they've got an advisory board to get stuff into the advisory board. have to have 72 hours advanced notice. Uh, inside of that documentation, you could have risk assessments, rollback plans, sign-offs, all that type of structure.

to make changes to anything that's getting close to production. The result of course is you go from like daily deploys and rapid application development to weekly deploys, to bi-weekly, to monthly, because no one wants to fill in the documentation and there's so much of it, it becomes suffocating. The overhead of approval becomes greater than the overhead of batching those changes. And so the velocity collapses.

engineers get frustrated and leave and now the company ends up in trouble because they can't get the changes into a production environment because no one wants to go to the change advisory board with all the changes that need to get in there. And you can't afford for 20 % of your engineers doing process management. I mean, that's also not a sustainable path. And so you have to find a trade off.

If we look in the Accelerate book, there's an awful lot of research that they've done in there about the fact that change approval boards, change management boards, don't actually increase velocity or product stability because the change advisory board doesn't necessarily have enough insight into what's going on to actually determine whether or not the changes in any use. It's just they...

are there as gatekeepers to stop stuff going into production. So that becomes a problem. Then you end up, as you're scaling to an enterprise style method of operation, that you end up with a specialization problem. So...

As you scale, it's very tempting to go, okay, well, I need a database administrator because we've got databases and we need someone to be able to administrate that stuff for us. But they become a bottleneck for all the database changes because at that point, everybody is then just sending the data, sending the database changes to the DBA rather than implementing themselves. They have to get sign off. They have to go through Liquibase or whatever.

it becomes a bottleneck for everything in the business. And so that means that then feature generation and feature implementation slow down. You've got a similar story when it comes to hiring a security engineer. So you've got, a review process and a security engineer has to go through 10 features, 10 new features every week. Each review takes a few days because they need to understand what's impacted, what the change is actually gonna...

effect in terms of blast radius, what's exposed to the end user, all that type of thing. And so two week sprints then become three week sprints because you're all waiting for the review to come from the security engineer. so, as you build these things out, it's very tempting to get specialists in. And for some things, clearly it is, but like

If you can avoid it, do so. Get generalists, get full stack developers, get people that know enough to be able to implement the things that you need across a wider range of...

components in your system.

So then we move on as well to Enterprise Playbook's decision-making 101. And so you have instead of something that would ordinarily take a day or should take a day, it ends up taking a month because you've got a schedule and architecture review board meeting. Turns out some guy who's supposed to be on it is on vacation. So you wait a couple of weeks for him to come back vacation. Then when he's back from vacation, you do the review.

Some changes get requested, which is obviously completely normal. You make the changes, you try and schedule another review, two other people can't make it, you wait two more weeks. And so it gets, takes like three months to get stuff, get stuff into production. At which point the markets moved on or the competitors shipped a similar feature or the engineers just gone, I'll tell you what, I'm out. I'm going to move somewhere that's like, less owners from a bureaucracy perspective.

And so moving slowly enough, because you've implemented an enterprise playbook for decision-making that the company dies is obviously a reasonably catastrophic mistake to make and something you also have to work towards mitigating. It's not to say that you don't have a decision-making tree. It's just, you need to make it flexible enough so that you can make decisions still without having a massive impact as you continue to grow as an organization.

Then you've got a tooling problem as well. So as organizations grow to, large scale adoption, they want to start using things like JIRA because that's apparently how you professionalize these things. And so you spend like six months configuring workflows, boards, custom fields. You've got a JIRA administrator whose job it is to administrate JIRA on a full-time basis. That's literally all he has to do. And it's

a likely scenario because JIRA in itself is full of crazy amount of screens, configuration, people want to use it in different ways. And so, you you'll end up having someone that's reasonably dedicated to administrating JIRA. I don't know of an engineer who actually speaks highly of JIRA. I'm sure there are a few, but engineers generally hate it. It's slow, complicated and gets in the way. You don't need enterprise tooling. What you need is simple and effective tooling. And so,

If you're using GitHub and GitHub issues works for you, or you want to wrap it with Zen Hub or something along those lines, it keeps it simple, flexible and everything in the same place. Your engineers will thank you for it and the velocity will generally be quicker. Yeah. And there's a million different comparable tools that users will end up having.

tools that are forced upon users as opposed to stuff that they actually want to be able to use in a day job. And it's finding that balance between the tools that actually make the organization as a whole more effective and the tools that are effective for the development team so they can get their work done.

Then there's also, for example, the framework problem. As you start implementing enterprise playbooks, you end up with a smallish company that wants to implement something like full safe. And so you've got PI planning every quarter, release trains, solution trains, like these new vocabularies that are alien to a lot of developers and PMs in smaller businesses. And so you end up in a smallish organization that spent

a huge amount of time implementing a framework, so much time that they've stopped actually building the features that they were interested in building in the first place, for very little gain, other than the fact it makes the senior leadership feel more at ease. What they actually needed was clearer priorities, good communication, and someone with authority to make decisions. But they got sold an enterprise framework instead. And you see this

in a number of different places and it's something that's easy to avoid.

Okay, so moving on to what you actually need. And this is an interesting discussion to have in itself because what you want to be able to do is build your own playbook. The uncomfortable truth is there really is no playbook for that medium-sized company. No one wrote one and they're often so bespoke that you need to write your own anyway.

The startup book ends at 50 employees, the enterprise book starts at a thousand people, you're somewhere in the middle. And so you need to create your own playbook. It doesn't mean that you're screwed, but stop looking for someone else's solution. You need to build your own.

So when you come to building out the framework, there's three real questions for every decision that you want to make inside of an organization.

Question number one, what problem am I actually solving? Not what problem is Spotify solving, not what problem is Google solving, what problem am I solving for my company at my scale? And if you can't answer that question, why are you implementing it?

Question number two, can I do less? Startups move fast by doing less. They do the stuff that's important for them as a startup. You can't move that fast, but you can do less. Every process, every tool, every framework comes with a cost. So what's the simplest thing that you can implement that solves the problem that you're trying to solve? Simple as that.

And the third question in the framework is what breaks if this fails? If you're in a small company, one feature breaks, you fix it. Your company, your one feature breaks, you lose customers and you lose revenue. But also if you move too slow, your competitors win. And so you have to be able to find the balance inside of your company where you've got that risk or reward trade off of if a component

isn't fully fleshed out or if it falls over when it's deployed, what the knock-on impact of that is to you, to your business. And that balance is different for every single organization.

So you end up with a deployment process question as well for your playbook. Like, what is the approach for deployment? If you're a startup, you chuck stuff into production, whatever. Literally, you've got a CI, CD process. just spins that stuff out into production. You let people test it. There's no approval needed, and they move fast, and they break things. But because they're small, agile, they've got a small user base, that deployment process works for a startup. In the enterprise approach, as we discussed earlier,

You end up with a change advisory board. You've got 72 hour notice to get that documentation ready. You've got risk assessments. You've got layers of approvals and that takes time. So what's the approach that you look for? Well, your approach is like deploy to production daily or multiple times per week. But you also then have tests because you've looked at the test.

debt and gone and fix some of it. So you've got required automated tests that must pass. You've got mandatory code reviews from one developer to another so that everyone has to have their code signed off. You've got monitoring and a rollback plan that's documented. That's important because one thing is implementing the rollback plan and another one is remembering how to run it at 3 a.m. on a Sunday night because your stuff's fallen over. And for high risk changes, get that extra review.

Like get the review, it's not a problem, but just don't have that as a mandatory step for literally everything you're trying to get through. And this is faster than enterprise, but it's safer than a startup. And that makes it right for a mid-sized company that's in that middle.

So stop asking like, what does Netflix do? What does Spotify do? What's the industry standard? If you're on LinkedIn reading Spotify engineering blogs, great. Like take that information and take what they do and understand what they're doing, why they're trying to do it. But also appreciate that like Spotify's engineering blogs is to cover off the millions of users that they have around the globe from a latency perspective, from an on-time perspective, from an access perspective. Like they have to take into account.

a lot more than the average company. Same with Netflix, obviously. This is why the SREs get paid an absolute fortune because their livelihoods are wrapped up in the always on-ness of Netflix. If it was down half the time, they would lose all their subscribers. And so, like, they have to do one thing, but you do not have to run to that same model. And so instead, you start asking, like, what do we actually need?

What can we actually execute? And what's the minimal viable version of the thing that you're going to deploy? And that can be like one component in a system, not the whole platform. You've got your running platform. How do you iteratively update, upgrade and deploy different bits of software on an ongoing basis that allows for your product to continue to move forward without having all the engineering overhead that Netflix or Spotify or Google or whoever do.

Because you're not building for where you want to be, you're building for where you actually are with a path to where you're going. So don't like block yourself off from moving forward, but build with an eye for the future, but for where you actually are now.

Okay, so that was episode number one. You have to remember, it's not your fault. You succeeded as an organization, you built something that people want to use, leverage, consume, and you grew. And that's amazing. But growth in itself just creates a world of new problems, things that you haven't appreciated or understood or thought about, suddenly rear their head and become mission critical issues with

the way that your organization grows going forward. And the advice that you get when you look at these things, either for startups or for scale enterprises are just totally different as your organization scales. And the reality is like startup advice assumes that you're still trying to find product market fit when you're not. Enterprise advice still assumes that you have enterprise resources when you don't and you won't for a long time until your organization becomes

that huge company that requires a globally distributed team of engineers who are working in data centers around the globe. Like that's a long way off. So what you need is something different, something that can be customized to where you actually are. And it's in some respects, it's harder because you can't just copy Netflix, Spotify, your favorite unicorn or Fang. Like you have to be able to create your own bespoke

playbook. And you must think critically about what works for your size and also assume that you're going to grow. Because that way you can start to forward think some of these playbook ideas and make sure they're going to work as your team gets larger over time.

But the good news is once you stop trying to be something that you're not, stop forcing playbooks that don't fit, start building systems that match your actual reality, things get better and they get better quickly. You can escape technical purgatory, but not by following someone else's map.

So I hope that you've had a good time. I hope that you found some of this podcast useful. I'll be back with episode two shortly. My name is Tom. This is Engineering Evolved, and I will speak to all soon.


Episode Video