You Don't Need Kubernetes: Right-Sizing Platform Engineering for Mid-Size Companies
play Play pause Pause
S1 E11

You Don't Need Kubernetes: Right-Sizing Platform Engineering for Mid-Size Companies

play Play pause Pause

You don't need Kubernetes.

I know that's hearsay in 2025, but hear me out.

At NASA, we had brilliant engineers spinning up complex orchestration systems for teams of
five people.

When funding cycles shifted, those systems became maintenance nightmares that consumed
entire sprints just keeping the lights on.

Your mid-sized company is probably making the same mistake right now.

You're building enterprise infrastructure

the startup size teams and it's quietly cutting your velocity.

Today we're talking about platform engineering that actually fits companies in the missing
middle.

Not vendor pitches, not conference driven development, just what works when you've got 200
to a thousand employees and need to ship products, not maintain infrastructure.

I'm Tom Barber and this is Engineering Evolved.

So welcome to this episode of Engineering Evolved.

This is episode 11.

I'm Tom Barber.

And today, as you can see, I'm not in the office.

I'm in the UK on a boat.

So I do own a boat in the UK.

I enjoy sailing.

And I'm on my way to help out a customer in Poland and work with their team.

And so this is a useful stopping off point on the way over from the States.

This show is for engineering leaders at midsize companies who are tired of advice that
only works for enterprise, enterprises or startups.

Today's topic is platform engineering, but we're not doing the usual CNCF project bingo
card.

We're talking about what actually works when you have real constraints, limited headcount,
finite budgets, and a business that needs features shipped, not infrastructure admired.

We'll cover the build versus buy framework that helps you make platform decisions without
regret, how to right size your infrastructure so it serves your team instead of consuming

it, and internal tooling that actually matters versus the stuff that just looks good in a
tech blog.

By the end of this episode, you'll have a framework you can use Monday morning to evaluate
every platform decision your team is facing.

So let's get into it.

So before we dive into frameworks and decisions, I need to tell you about a story from my
time at NASA's Jet Propulsion Laboratory.

Not because it's a success story, but because it perfectly illustrates what goes wrong
when smart people over-engineer platform infrastructure.

We had a small research team.

It was three or four engineers working on a data processing pipeline for mission analysis.

Someone on the team had come in from a big tech company and was excited to bring modern
practices to JPL, which in practice meant Kubernetes, service mesh, custom operators, and

a lot of YAML.

But the problem is that whilst the system worked, it was, and it was technically
impressive.

We could scale containers, route traffic dynamically and handle complex deployment
scenarios.

The infrastructure itself was absolutely enterprise grade for about six months.

Then the funding model for that particular project shifted and NASA operates on mission
cycles and congressional budgets.

So projects don't just gradually wind down, they hard stop.

And then half the team got reassigned to have emissions.

The person who built the Kubernetes setup moved to a different lab.

And suddenly we had just one or two engineers trying to maintain this intricate
orchestration system that was designed for a problem we didn't actually have.

Deployments that should have taken five minutes took an afternoon because someone had to
remember how the custom operator works.

Debugging a service issue meant tracing through three layers of abstraction.

And so we ended up spending more time maintaining the platform than we spent building the
features for the actual science mission.

Eventually we ripped it out, replaced it with standard Docker containers running on VMs
with a simple bash script for deployment.

It was boring.

It was unsexy, but it took 20 minutes to onboard a new engineer instead of two weeks.

And that experience taught me something crucial.

The sophistication of your infrastructure should match the sophistication of your team's
needs and capacity, not the sophistication of the tools available and definitely,

definitely not the sophistication you think you might need someday.

Midsize companies faces constantly, you've got engineering leaders who came from Google or
Facebook trying to replicate what worked there.

You've got vendors selling you enterprise platforms.

You've got developers who read

tech blogs and want to work with the latest tools.

But you're not Google.

You're not a startup that can rip and replace infrastructure every six months.

You're in the messy middle and you need infrastructure that serves your actual
constraints, not your aspirational ones.

So let's talk about how to make that happen.

But before that, a word from our sponsor.

Okay then, so first let's define what we mean by platform engineering in the context of
mid-sized companies.

Platform engineering isn't about adopting every project in the cloud native computing
foundation landscape.

It's not about having the same tooling as companies with thousands of engineers.

Platform engineering for mid-sized companies is about building the minimum viable
infrastructure that lets your product team ship faster.

With less operational burden,

and fewer surprises.

That last part, fewer surprises is critical.

When you're a mid-sized company, surprises are expensive.

You don't have the depth of expertise to handle every edge case.

You don't have on-call rotations that can absorb infrastructure incidents without
impacting feature delivery.

You can't afford to have your platform become the product.

Here's the mental model I want you to adopt.

Your platform engineering efforts should be invisible when they're working.

The best platform is the one your developers barely think about because it just works.

So this means making different trade-offs than enterprises make.

Enterprises optimized for scale, flexibility, and handling every possible edge case.

You need to optimize for simplicity, maintainability, and speed to competence.

When a new engineer joins your team, how fast can they deploy something to production?

If the answer is more than a day, your platform is too complex.

Mid-sized companies also face a unique challenge with internal tools.

You've outgrown the startup phase where everyone just uses the same three SAS projects
products, but you haven't reached the enterprise scale where you have dedicated teams

building internal developer portals and service catalogs.

So you're stuck in that awkward middle where you need internal tooling, but you can't
afford to build everything custom.

And that's where the build versus buy framework becomes essential.

Before we get there though, let's talk about the biggest platform engineering mistake I
see mid-size companies make.

So let's address the elephant in the room, Kubernetes.

And I'll put my hand up and say, I use it in various places where the fit seems right.

But I'm not saying that Kubernetes is bad.

And I'm also saying that it's almost certainly overkill for your mid-sized company and the
operational burden it creates will quietly destroy your team's velocity.

Here's what happens.

You hire a talented engineer who worked at a company running thousands of microservices on
Kubernetes.

They're now horrified when they're deploying to VMs with shell scripts.

They pitched their business on modernizing your infrastructure.

Everyone nods along because well, Kubernetes is industry standard, right?

Six months later, you've got a cluster running.

Maybe you've migrated one service, but now you need to understand pod networking, ingress
controllers, persistent volumes, R-back policies, and how to debug a crash loop back off.

Your infrastructure team of two people are spending half their time just keeping
Kubernetes healthy.

Meanwhile, your competitors are shipping features because they deployed to a VM with
Docker and moved on with their lives.

Here's the uncomfortable truth.

Kubernetes solves you problems you probably don't have yet.

It's designed for organizations that need to run hundreds of thousands of services, scale
workloads dynamically across large clusters, handle multi-tenant infrastructure.

operate in multiple regions or cloud simultaneously.

If you're a 500 person company running 20 microservices, you don't have those problems.

What you have is a small team that needs to ship features predictively without getting
paged at 3am.

So what should you do instead?

Deploy things to VMs.

Use Docker for packaging because containers are genuinely useful for dependency management
and deployment consistency.

But you don't need orchestration.

You don't need service mesh.

You don't need operators managing operators.

Here's a pattern that works incredibly well for midsize companies.

You've got an application in a Docker container, which is great.

That container runs on a VM managed by your cloud provider, an EC2 instance, a compute
engine instance, whatever.

You use infrastructure as code.

So Terraform, Bicep, or something along those lines to define that VM and its
configuration.

No clicking around in consoles trying to remember what settings you've changed.

For deployments, you write scripts, bash is fine, Python if you prefer.

The script pulls the new container image, stops the old container, starts the new one,
runs health checks.

Your version control these scripts in your repository right alongside your application
code.

GitHub Actions or GitLab CI runs your tests, builds your container, pushes it to a
registry, and runs the deployment script.

That whole pipeline is visible in a single YAML file that junior engineers can understand.

Is this sophisticated?

No, not really.

Does it work reliably for companies shipping millions of dollars in revenue?

Absolutely.

The key insight here is that operational burden compounds over time that Kubernetes
cluster might seem manageable today, but when you've got three engineers who understand

it.

But what happens when those engineers leave?

What happens when you need to upgrade your cluster version?

What happens when a pod gets stuck at a weird state at 2 a.m.

and nobody on call knows how to debug it?

Sinful infrastructure has a hidden superpower.

It's maintainable by a team that you'll have next year, not just the team that you have
today.

Now, of course, there's a counter argument here.

Someone always says, but Tom, we need to learn these modern tools to attract talent.

Engineers want to work with Kubernetes.

So I have two responses to that.

First, engineers who only want to work somewhere because you have specific tool are
optimizing for resume building, not for solving your business problems.

You want engineers who care about impact and not infrastructure tourism.

Second, mid-sized companies have an advantage here.

You can offer engineers ownership and impact they'd never get at a large company.

At Google, you work on one microservice in a sea of thousands.

At your company, an engineer can own an entire product area

and see their work directly affect revenue.

And that's a much better retention story than we use Kubernetes.

So let's be clear, if you're at a point where you generally need Kubernetes, you've got
the team size and the operational maturity, the scale demands, then absolutely use it.

But if you're adopting it because it feels like that's what most serious companies do,
you're making a mistake that will cost you months of engineering time and possibly some of

the best people who would

rather ship products to maintain infrastructure.

Now let's talk about the framework that will save you from most platform engineering
mistakes.

When to build versus when to buy.

And the decision comes up constantly.

Do we build our own deployment pipeline or do we use a managed service?

Do we build internal admin tools or buy something off the shelf?

Do we build our own observability stack or use a vendor?

Now most companies make these decisions emotionally.

Someone on the team gets excited about building something or someone else is frustrated
with a vendor and wants to rip it out.

But emotional decisions need to regret about six months later when you realize that the
thing you built needs maintenance or the thing you brought doesn't quite fit the workflow.

So here's a framework that I use and it's based on four factors.

Maintenance burden, time to value,

strategic differentiation and exit costs.

Factor one, maintenance burden.

For every piece of software you build, it creates an ongoing maintenance obligation.

Dependencies need updates, bugs need fixes, new employees need documentation and training.

If you build a deployment tool, you're now in the business of maintaining a deployment
tool forever or until you sunset it.

which is its own project.

The question you need to ask is, can your team absorb this maintenance burden without
impacting product velocity?

Here's a heuristic.

If the thing that you're building will require more than one engineer a week per quarter
to maintain, you should be very confident about why you're building it instead of buying

it.

Let me give you a concrete example.

Internal admin tools seem like an easy build.

You need a way for your operations team to manage customer accounts, run reports, trigger
background jobs.

How hard can it be?

But here's what actually happens.

You build one version of the week and it works.

Then you need to add authentication.

Then you need to add audit logging because of compliance.

Then you need to add role-based access control.

Then the UI needs to be mobile friendly.

then you need to add export functionality.

Then you realize your database queries are slow and you need to optimize them.

18 months later, you've got a custom internal tool that consumes six months of engineering
time and still doesn't work as well as something like Retool would have given you on day

one.

And that's not just hypothetical.

I've seen like this pattern at multiple companies, including ones I've worked at.

Maintenance burden is often invisible when you're making the build decision.

You're only seeing the exciting part, building something new.

You don't see the boring part, which is maintaining it for years whilst your business
priorities shift.

Factor two is time to value.

How quickly do you need this capability in production, delivering value?

If the answer is now, you almost always buy.

If the answer is we can wait six months to get exactly what we want, you might build.

Mid-sized companies usually don't have the luxury of waiting.

Your competitors aren't waiting, your customers aren't waiting, your board isn't waiting.

This is where hosted services and SaaS platforms shine.

You can have authentication set up in hours with ortho instead of weeks of building it
yourself with all the additional security.

You can have observability running today with Datadog instead of three months with a
custom stack.

The calculus changes if you've got very specific requirements that no vendor solves or if
you're in a domain where the tool itself is strategic.

But for most platform infrastructure, CICD, monitoring,

log aggregation, internal tools, someone has already solved your problem better than you
will in your first iteration.

Factor three is strategic differentiation.

And this is the question that separates good build versus buy decisions from great ones.

Does the capability differentiate your business or is it table stakes?

Your deployment pipeline, table stakes.

Every software company deploys code.

How you deploy doesn't make customers choose you over your competitors.

Your internal fraud detection system, if you're a fintech company, that might be
strategic.

It might directly impact your unit economics and customer experience in ways that generic
tools can't match.

Your employee onboarding tool, probably not strategic.

HR workflows are important, but they're not why customers pay you.

So here's a test.

If you do this exceptionally well, do customers notice and care?

And so if you answer that question is yes,

and maybe consider building.

If it's not, just go and buy something off the shelf.

Most infrastructure is not strategic.

It's necessary, but it's not differentiation.

That's the ICD pipeline you're thinking about building.

GitHub actions exist.

Your competitors are already using it.

You building a custom alternative doesn't make your product better.

Strategic differentiation also has team size component.

If you're a 300 person company,

You probably can't afford to have engineers building internal tools that don't directly
support your core product.

Every engineer you dedicate to platform work is an engineer not building features
customers will pay for.

The math changes at 3000 people, but you're not there yet.

Exit for factor four is exit costs.

The last factor is often overlooked.

How hard is it to change your mind later?

Some platform decisions lock you in.

If you build a custom deployment system that's deeply integrated into your Monorepo and
CICD workflows, migrating away from that later will be a six month project.

If you adopt GitHub actions, switching to GitLab CI or build kite later,

is a few weeks of work maximum.

The configuration is similar.

The concepts are the same.

So that means that the exit costs, the cost is low.

This factor should tilt you toward buying, not building, especially early in your platform
journey.

Buy the thing that has the lowest exit cost, even if it's not perfect.

You can always replace it later when you understand your needs better.

Custom builds have

infinite exit costs because you can't just unsubscribe.

You have to actively build a replacement.

So how do you use these four factors?

Here's the decision matrix.

Build if maintenance burden is low relative to team capacity, you have the time to build
thoughtfully, the capability provides strategic differentiation the buying options have

high lock-in or exit costs.

You should buy if your maintenance burden would strain your team, you need value quickly,
the capability is table stakes and not strategic,

and multiple vendors exist with reasonable exit costs.

And you should defer if you're not sure yet whether you need this capability, your
requirements are still evolving rapidly, or the market for this capability is changing

quickly.

That last category, defer, is important.

Not every platform decision needs to be made immediately.

Sometimes the right answer is to live with a manual process for another quarter while you
figure out what you actually need.

Okay, let's talk about internal tools specifically because this is where mid-sized
companies waste an enormous amount of time.

You've probably got requests piling up for internal tools.

The ops team wants a better admin panel.

The support team wants custom dashboards.

The data team wants a better way to bring queries.

Engineering wants a service catalog.

Most of these internal tools don't need to be built.

And that's just a hard fact.

They need to be bought or configured.

Platforms like Retool exists specifically to solve this problem.

You can build internal tools by connecting your database, APIs and services with drag and
drop interfaces.

What would take weeks to build custom takes hours in Retool.

Is it perfect?

No.

Will your engineers complain that it's not as elegant as something you'd build?

Probably.

But will it deliver 80 % of the value for 5 % of the effort?

Absolutely.

And that's what matters.

The key insight here is that internal tools don't need to be beautiful.

They need to be functional.

Your operations team doesn't care if the admin panel uses the latest React framework.

They care if they can reset a user's password without filling in an engineering ticket.

Your internal tools should have exactly one design principle.

Reduce the number of times people need to ask an engineer to do something manual.

That's it.

If your internal tool does that, it's good enough.

Now there are internal tools worth building and here's a list.

Developer productivity tools are generally custom, genuinely custom to your domain.

If you've got a complex deployment process, it involves coordinating across multiple
services, environments and teams, and that process is specific to your architecture, then

yes, build tooling for that.

Integration glue between your specific systems.

Sometimes you need to connect three SaaS products in a way that's unique to your workflow.

A bit of custom code to shuttle data between Salesforce, your billing system and your data
warehouse.

Absolutely reasonable.

Performance critical paths where vendors are too slow or expensive.

So if you're processing millions of transactions and your observability vendor is charging
you 50K a month, there may be a case for building something light away, but be realistic

about whether you'll actually save money after accounting for engineering time.

Everything else, buy it, configure it, or use an internal tool platform.

How about using retool or similar platforms for anything user facing that needs a UI?

Use scripts and CLI tools for automation that runs in the background.

Use your cloud providers native tools for infrastructure management.

Your internal tooling strategy should be boring, predictable, maintainable by whoever's on
call.

And now a quick word from our sponsor.

Okay, now let's talk about something that's not sexy, but will save you more time than any
platform tool.

Keeping your infrastructure organized, accessible and documented.

And this is something that can kill mid-size companies.

Knowledge hoarding.

One engineer knows how the deployment works.

Another engineer knows where the secrets are stored.

The third engineer knows how to debug production issues.

When those engineers leave and they will leave, that's...

just how careers work, your platform knowledge leaves with them.

So you need two things, organization and documentation.

Every piece of your platform should have an obvious home.

Your infrastructure code lives in a specific repository with a specific structure.

Your deployment scripts live in a specific place.

Your run books live in a specific place.

If someone needs to deploy a service, they should know exactly where to look.

Not acera, not check slack history, not dig through three different repos.

They should look in the obvious place and find what they need.

But this requires discipline, of course.

When someone writes a new deployment script, it goes in the standard location.

When someone updates infrastructure, they update the infrastructure repository.

No one-off scripts living in someone's home directory, no critical documentation in
personal notes.

Documentation of course means writing things down and like all the things.

Every deployment process should have a run book.

Every piece of infrastructure should have a read me explaining what it does, why it exists
and how to change it.

Good documentation has three levels.

The quick start, how do I deploy this service?

How do I update this infrastructure?

Give me the commands I need to run.

The explanation, why is this configured this way?

What problems does this solve?

What did we try before that didn't work?

And the reference, where are all the pieces?

What are all the configuration options?

How do I troubleshoot common issues?

The thing is, of course, most teams only write the first level if they write anything at
all.

But the second and third levels are what prevents knowledge loss when people leave.

When someone on your team has to figure out something about your infrastructure that
wasn't documented,

They have to document it before moving on to the next task.

There are no exceptions.

Someone spent two hours figuring out why deployment failed in staging.

They write a troubleshooting guide.

Someone had to learn how to rotate database credentials.

They document the process.

Because of course this compounds over time.

Six months later, you've got a knowledge base that actually reflects how your
infrastructure works, not how you wish it worked or how it worked two years ago.

So having a quick discussion about access control and secret management,

This is part of organization, but it's important enough to call out specifically.

Your platform needs clear access control that's documented and audible.

Who can deploy to production?

Who can access production databases and who can modify infrastructure?

The answers to these questions should be written down in force by tooling, not just social
convention.

Use your cloud providers IAM properly, use secrets managed

secret management tools like AWS secrets management manager or has she caught vault make
it impossible to accidentally do dangerous things and make it audible auditable.

Can't say that word this evening.

When someone does authorize things.

This isn't about trust, it's about reducing blast radius when something goes wrong.

So finally, let's talk about automation and AI in the context of platform work.

The principle here is simple.

Automate anything that you do more than twice.

Deployments, automated.

Infrastructure changes, automated.

Database migrations, automated.

Security scans, automated.

Every manual process is an opportunity for human error and an opportunity for knowledge to
become siloed in one person's brain.

Of course, the good news is that automation tooling has never been better.

GitHub actions, GitLab CI, CircleCI maybe, Jenkins, if you're a sadist, I say in jest, of
course Jenkins is still very good.

All of these make it straightforward to set up pipelines that handle most of your
automation needs.

The key is to start simple.

Don't build a complex deployment system with 50 configuration options on day one.

Build the thing that automates your most common case and then iterate.

Your deployment scripts should probably do this.

Run tests, build the container, push to registry, SSH to a production VM, pull the new
container, stop the old container, start the new container, run a health check, send a

notification.

And that's probably 50 lines of bash or a hundred lines of Python.

It doesn't need to be more sophisticated.

than that until you have evidence that you need more sophistication.

Now, of course, AI and LLMs are changing automation in interesting ways because you can
use an AI to help write these scripts instead of spending an afternoon fighting with the

AWS CLI syntax.

You describe what you want and let an LLM generate the first draft.

You review it, you test it, you refine it.

And that's a totally legitimate use of AI that saves time about introducing complexity.

You can use AI to help write documentation, take your deployment script, feed it to an
ALR,

LLM and then ask it to generate a run book.

You need to edit and verify, but you've got a starting point that's better than a blank
page.

You can use AI to help debug issues.

When something goes wrong in production, you can paste error logs into an LLM and get
suggestions for what might be happening.

It won't replace understanding your systems, but it can save time in the diagnostic
process.

The key with AI is to use it as a tool to reduce menial parts of platform work, script
writing, documentation, debugging suggestions without letting it make critical decisions

about your architecture.

And whilst it's getting better, AI is terrible at architecture decisions because it
doesn't understand your specific constraints, your team capabilities, or your business

context.

But it's great at generating boilerplate, writing documentation, and suggesting solutions
to common problems.

So use it strategically.

Let it handle the grunt work so your engineers can focus on decisions that actually
matter.

Of course, a word of caution about over automation because you can have too much.

Because I've seen companies build elaborate automation systems that handle edge cases that
almost never happen.

The automation becomes more complex than the original manual process.

The test for whether to automate something, will this save time over the next year,
accounting for the time to build, maintain the automation.

If you do something once a month and it takes 10 minutes manually, that's two hours a
year.

Building automation that takes four hours to create and an hour per quarter to maintain is
a net loss.

Just do it manually.

But if you do something daily and it takes 30 minutes, that's 180 hours a year.

So spending a day to automate it is a massive win.

To be honest about the frequency and the pain of your manual processes before automating
them.

Let's talk about team sizing for platform work because this is where mid-size companies
often get the math wrong.

And here's a mistake.

You think about the platform team size based on how many engineers you have, not based on
what you're delivering.

Someone says, we need a platform team of five people because we have 50 engineers.

But what are these five people doing?

Are they delivering capabilities that make those 50 engineers dramatically more productive
or are they maintaining infrastructure that exists because you chose complex tools?

The right way to think about the platform team sizing is in terms of maintenance burden
and new capability delivery.

For every piece of platform infrastructure you operate,

Someone needs to maintain it.

That maintenance has cost, security patches, version upgrades, instant response,
onboarding new engineers, asking questions.

So if you're running Kubernetes, you need someone who can handle cluster upgrades,
networking issues, and all that operational complexity that comes with orchestration.

And that's probably half a person's time, minimum.

If you're running VMs with Docker, you need someone who can handle VM management and
container deployments.

That's maybe 10 % of a person's time.

The difference matters.

It's the difference between a two-person platform team and a five-person platform team.

It's the difference between spending 20 % of your platform budget on maintenance and 50%.

Now layer in new capability delivery.

If your platform team is spending half of their time on maintenance, you can only spend
the other half building new capabilities.

You need to be ruthless about this calculation.

So look at every piece of infrastructure you operate and ask, what is the ongoing
maintenance cost?

What is the cost to add new features or capabilities in the future?

This applies to internal products as much as external facing products.

That internal admin tool you built, it needs maintenance.

It needs updates when your data model changes.

It needs new features when the operations team workflow changes.

If you're not accounting for those costs, you're accumulating technical debt in your
internal infrastructure.

For every engineer you dedicate to building new platform capabilities, you need at least
0.25 engineers dedicated to maintaining existing capabilities.

If you've got four people building, you'll need one person maintaining.

If your ratio is worse than that, if maintenance is consuming more than a quarter of your
capacity, you've accumulated too much infrastructure complexity and it is time to

simplify.

All right, we've covered a lot of ground, so let's bring it together.

Platform engineering for mid-sized companies is not about adopting the same tools as
companies with 10 times your headcount.

It's about building infrastructure that serves your actual needs with the actual team you
have.

This means being skeptical of complexity.

It means choosing boring, reliable tools over exciting, sophisticated ones.

It means buying rather than building, unless of course you have a really good reason.

And it means being honest about the ongoing costs of everything you operate.

I put together a concept comprehensive build versus buy decision framework that walks you
through every factor we've discussed today.

And this includes a decision matrix for evaluating build versus buy for any platform
capability, team sizing calculators that help you understand maintenance burden, the

platform complexity audit that helps you identify where you've over engineered and right
size and guidance for common platform components.

You can download it at engineeringevolve.com

It's free, it's immediately useful, and it will save you months of regret when you're
making platform decisions.

If you found this episode useful, share it with other engineering leaders who are fighting
the same battles.

We're trying to create a community of people who are willing to say, maybe we don't need
Kubernetes and actually mean it.

Join me in the next episode where we dig into some more interesting engineering insight
that will be practical, opinionated, and full of stories about what actually works versus

what looks good in architecture diagrams.

Until then, keep your infrastructure boring, keep your documentation current, and keep
shipping.

I'm Tom Barber.

This has been Engineering Evolved.

Thanks for listening.


Episode Video