Cloud Migration Without the Chaos - A Product Manager's Approach
play Play pause Pause
S1 E7

Cloud Migration Without the Chaos - A Product Manager's Approach

play Play pause Pause

Tom Barber (00:00)
Let me tell you about two cloud migrations. First, TSB Bank. They picked a weekend in April 2018. had 1400 staff on hand, months of planning, Sunday cut over, and by Monday morning, 5.2 million customers couldn't access their accounts. The final cost? Over £1 billion. The CEO resigned. Regulators fined them £48.6 million.

and the CIO got personally fined 81,000 pounds. Secondly, Netflix. Seven years to migrate everything to AWS. That's seven full years. And you know what happened during that migration? They grew from 12 million to 93 million subscribers. They expanded to 130 countries and built the world's largest streaming platform. Same goal, moved to the cloud, completely opposite results.

The difference wasn't the technology, it wasn't the budget. It was treating migration like a product launch instead of an IT project.

Welcome back to Engineering Evolved. I'm your host, Tom Barber, and this is episode seven. And today we're talking about something that keeps product managers and technical leaders up at night, and that is cloud migration. Here's the thing that nobody wants to say out loud. Half of all cloud transformations are abject failures. They're not disappointing or wronger than expected. Abject failures. And that's from a 2024 study by HFS Research and EY.

And honestly, I think that number is generous because it only counts the ones that fail so badly they couldn't hide it. Here's what's interesting though. The companies that succeed aren't the ones with the biggest budgets or the fanciest tech. They're the ones who treat infrastructure migration like launching a new product with phases, with metrics, with rollback plans and with actual product managers owning the outcome. So today we're going hands on.

no fluffy cloud is the future talk. We're building you a migration framework you can actually use. And what we'll cover is why slow beats fast every single time. The deployment strategies actually work when things go sideways. What to monitor when you're running two systems at once and what happens when AWS, Google or Azure just falls over. Sounds good. Let's get into it.

So here's the first thing we need to own up about the big bang migration. It's dead. I want you to stop even considering it. I know your CTO, if you're not running stuff in the cloud is standing there in a conference room with a beautiful Gantt chart showing a three day maintenance window. I know your AWS rep is nodding enthusiastically about how streamlined the process is. I know finance is thrilled about getting those data center costs off the books by next quarter.

This may not even be that you're not in the cloud. It may be that you're moving from one cloud provider to another or doing some other type of logistical switch. They all take planning and execution. But if you're trying to do it in one Big Bang, just don't do it. Because you know who loves Big Bang migrations? Consultants bidding by the crisis hour. So let me ask you something as well.

Have you ever shipped a major product feature without beta testing, without a phased rollout, without the ability to turn it off if something breaks? For most of you, of course not. That would be insane. So why would you do that with your entire infrastructure?

So here's what does work, the Strangler Fig pattern. And yes, I know it sounds like something from a horror movie, but stay with me here. The Strangler Fig is a tree that grows around another tree, gradually taking over until eventually the host tree is gone. And that's exactly how you migrate systems. You build a facade layer, think of it as a smart proxy that sits in front of everything. Initially, 100 % of your traffic goes to legacy system.

then you build one new service in the cloud and that facade then routes the traffic to the new system. Everything else stays on legacy or in its old place, continuing to operate just completely normally. The beauty of this of course is that every single point you have a working system, not it'll work once when we're done, it all works right now. And so like Pinterest, for example,

migrated over a thousand microservices way. They started with the easy stuff, the services with clear boundaries, minimal dependencies. They got those working, then they learned the patterns, they built the confidence, and then they tackled the hard stuff.

If you look at Netflix, took, like I said, it took seven years using this approach. so before you say, I don't have seven years, they were of course serving customers at the same time, growing, expanding internationally and launching features. so unless of course you don't expect your company to be around in seven years, which, okay, that's fair enough. You do have seven years to deal with the migration.

Compared that to TSB's weekend migration and they had all the staff, they had a plan, they had the cut over date, what they didn't have was a way to go back when things went wrong. And so by Sunday night, the entire executive team was in the office watching their customers' money disappear into a black hole.

So here's my question to you. Would you rather spend 18 months with a working system throughout or six months of aggressive timeline followed by 12 months of emergency fixes? Because that's the actual choice. The data is pretty brutal when you look at this. A huge number of cloud migrations fail and the ones that fail, they're the ones that always, almost always the ones that try to go fast. So now let's...

Let me give you the tactical framework. We need to categorize every application using what AWS calls the six Rs. So the six Rs of this, rehost, also called lift and shift. You're literally just moving the VM to the cloud. It's fast, but you're paying cloud prices for data center architecture. Replatform, you make minimal changes to optimize for cloud. Maybe you swap.

the database for RDS. Maybe you use load balances instead of your own. Pinterest did this brilliantly. They containerized everything but didn't rewrite the Refactor. This is the Netflix approach. When you rebuild everything cloud native, microservices, auto scaling, the whole nine yards. It takes forever, but the payoff can be huge. Repurchase. You could, of course, go and buy a SaaS product. Your custom built email server from 2010.

Gmail exists, just go and use it. Retire, and this one is my favorite, just turn it off. You know that reporting system that people use twice a year, kill it. Export the data to S3 and move on. I saw someone on LinkedIn recently who said that they were amazed by when they migrated a BI platform, how they didn't migrate an awful lot of the dashboards and reports are in there and no one said a word.

I think that's indicative of a lot of systems that work inside of enterprise is like people ask for these things, they get built, the expense goes into maintaining them, but then the usage is almost zero.

The other thing of course in the six hours is then to retain and so keep it on premise. Sometimes you have regulatory requirements, sometimes the math just doesn't work and that's also okay. But the mistake that most companies make, they choose refactor when replatform would work. They over engineer.

So my take is that most companies should replatform 70 % of their apps, rehost 20 % and retire or repurchase the rest. Refactor should be reserved for your actual competitive differentiators if it's not making you money or saving significant money, so don't refactor it.

Okay, so this is where it gets actually interesting. This is the part where we stop being infrastructure people and start being product people. Spotify figured this out in 2020 and they wrote the playbook. Now, of course, as we have already said in other podcasts, the Spotify model does not work for everyone and it doesn't work for a lot of medium sized companies. But at the same time, there's a lot we can learn from the stuff that they've done and then apply it at smaller scale. So they had...

300 plus engineering squads all drowning in competing migration priorities. Data center to cloud, old CI, CD to new, legacy monitoring to modern observability, and it was chaos. So they did something radical and they assigned actual product managers to migrations, not project managers. They assigned product managers, people whose job it was to make the migration successful with all the same rigor as launching Spotify premium.

So let's see what that actually means. First, they were ruthless at prioritizing. They built a single company-wide migrations map. One source of truth, every migration had to justify its existence with cost estimates and business impact. Just think about that. How many migrations are happening in your company right now that nobody can explain why they're important? Second,

treat it like a product launch. That means alpha, beta and production phases. Alpha is internal only. Your SRE team, your most technical engineers, people can troubleshoot when things break. Beta then opens it to wider but friendly users. Teams are excited about migration. Teams that will give you good feedback instead of just complaining. then production is for everyone, but.

And this bit is critical. You don't cut over over. You open it up and let teams migrate when they're ready. Provide incentives and gamify it. Spotify built a visualization showing each tribe's progress. Red bubbles for data center, green bubbles for clouds. Suddenly every tribe could see their progress compared to others and healthy competition kicked in. Of you may not have the scale to be able to do stuff like that, but there is obviously compelling reason to like monitor and deal with the understanding about how far through that process you are.

And so gamifying it may be one way of doing it.

So here's a concrete example. Plangrid migrated their feature flag system from their homegrown flipper to launchDarkly, which if you've used feature flags before, you probably know about launchDarkly. Phase one, dual write to both systems. Every flag goes to flipper and to launchDarkly. Phase two, dual read. LaunchDarkly is primary, flipper is fallback. If launchDarkly is down, gracefully fallback. Phase three, the beta playground.

the new UI available, but low stakes. No people try it. And then phase four, cut over completely and make the Slack channel read-only and you're done. That's it. Average response time dropped by 50%. Latency at 95th and 99th percentile improved dramatically and users barely noticed the infrastructure changed. And that's product thinking applied to infrastructure. So now ask yourself, who owns your cloud migration? Like actually owns it.

Is it someone who's measured on completion? Someone who has to live with the results? Someone who will be there six months after launch dealing with the long tail of problems? Or is it a project manager with 47 of her projects who's going to move on the day after you cut out?

So here's what you actually need. A dedicated PM assigned to the migration full-time, weekly stakeholder meetings with an open agenda, progress dashboard, pending decisions visible to everyone. Success metrics defined before you start, not it's done, but real metrics, performance cost, developer velocity, customer impact. You need a migration roadmap that looks like a product roadmap with releases, with feature flags, with the ability to roll back any release.

Communication built into the process not bolted on at the end. And most importantly, the PM's job isn't done at 95 % complete, it's done at 100 % because that last 5%, the long tail of reluctant teams, it's of course where migrations die. So Netflix took seven years. They treated it like building a product, continuous delivery of value, phased rollout, metrics at every stage.

TSB took a weekend because they treated it like an IT project. Big bang, all or nothing, no rollback. Which one would you rather be?

Let's talk about the part everyone skips in the planning meeting. What happens when it breaks? Not if, but when. And so here's the reality. Something will break. The question is whether you can recover in five minutes or five days. So there's this concept called the fall forward architecture.

that Netflix used for their billing database migration. And it's brilliant. You run three systems simultaneously, the source database, the target system, and the replica system. So you have your old database up and running, your new cloud database, and a replica of A, which is obviously perfect copy. And so your application's right to be the cloud database, but behind the scenes, you're continuously replicating A.

to the A replica using something like Oracle's Goldagate or AWS DMS. And the good part is if you need to roll back, you don't go back to A, you go back to A's replica because A has all the original data plus all the changes that happened during the migration window. So you've got zero data loss, like zero data loss. And this is what separates the adults from the children in cloud migration. Children pray nothing breaks.

Architects, adults architect for when it breaks. Just got to assume that it's all going to go wrong at some point. That's why migrations are hard.

So let me give you three rollback strategies based on how critical your system is. Strategy number one is the 10 minute recovery for most production apps. You've got an automated CI-CD pipeline, automated detection when something's wrong, automated rollback to a previous version, and this is your baseline. If you can't do this, you're just not ready to migrate.

Strategy number two, the three minute recovery for business critical systems, everything from strategy one plus expand and contract database patterns, schema changes deploy separately from code, both old and new code work with both schemas and you can redeploy code without touching the code base.

And then you've got stuff which is for instant recovery strategy number three. And this is for like the building is on fire systems. You've got a blue green deployment in production, a complete environment duplication, a low balancer that flips traffic in seconds. And yes, it costs two times for the duration, but that's the price of instant recovery. Like if your boss wants 99.999 % uptime and you've got some crazy distributed system, that's what you have to pay for.

So here's a real example. First SMS went from two hour recovery time to under five minutes, which of course is a 96-ish percent improvement. And you know what happened? Their executives stopped treating outages like existential crisis because they weren't anymore. When recovery is fast, you can move fast. When recovery is slow, you move like you're walking through a minefield, which affects everything from feature velocity to innovation and morale. Now,

I know what you're thinking, that sounds expensive. And of course, you're right, running blue-green doubles your infrastructure for hours or days. Full forward architecture means running three systems. But let me put this into perspective. Delta Airlines cancelled 5,000 flights during the CrowdStrike incident. $500 million in losses from one bad update. TSB's weekend migration cost £1 billion in total costs.

The average downtime according to Gartner is around $5,600 per minute. So yes, running duplicate infrastructure for 48 hours during a critical migration is expensive, but it's also the cheapest insurance policy you'll ever buy.

Now let's talk about Canary deployments for infrastructure because they work differently than code deployments. You're not deploying code versions, you're migrating workloads. So your Canary pattern looks like this. 5 % of workloads, 10 % of workloads, 25, 50, 75, 100%. At each stage you bake, you monitor and you validate. AWS SageMaker does 10 minute bakes with CloudWatch alarms. If error rates or latency cross

cross threshold automatic rollback. For high risk systems, might bake for hours. Run your full test suite in production. Manually verify critical paths. If you can't roll back in under 10 minutes, you're not ready to migrate that system. I don't care how good your testing is. I don't care how confident your architects are. If you can't get back to a working state quickly, something will bite you in production. So this isn't really negotiable. So ask yourself this right now.

For most critical system, how long would it take you to roll back a failed migration? And if you don't know the answer, that's your homework before the next stakeholder meeting.

Of course, let's move on to the most stressful phase for any migration, running two systems at once. So you've got your legacy environment, you've got your cloud environment, and you need to monitor both on the same dashboard in real time. And that's where a lot of migrations die, not from the migration itself, but from flying blind during the transition. So imagine you're in SRE at 2 AM and you get paged with checkout latency is spiking.

Is it legacy database? Is it the new cloud database? Is it network between them? The dual write logic, the replication lag. If you're switching between four different monitoring tools to figure out you've already lost, your mean time to resolution just went from minutes to hours. Here's something I found quite surprising when I was doing some research for this. Despite companies spending $17 billion annually on observability tools,

The meantime to resolution is actually getting worse. In 2022, 47 % of companies took over an hour to resolve production issues. By 2024, 73%. Why? Because teams are drowning in data, tool sprawl, alert fatigue, and no unified view. So try unified platforms, Datadog, New Relic, Splunk, Pick One, get really good at it.

Neto ran dual MySQL and Aurora databases for six months during cloud migration. Single Datadog dashboard for both. When issues came up, they could correlate metrics across environments, troubleshoot before customers even noticed. 99.95 % uptime throughout the entire migration, zero customer-facing incidents. Pantheon migrated 200,000 websites in two weeks.

New Relic APM monitored both legacy and GCP. They could trace errors down to specific lines of code in migrated workload. So the monitoring strategy isn't about having more data. It's about having the right data in one place at the right time. During migration, you need four types of visibility. Resource metrics, CPU, memory, network, disk, the basics.

And application performance, which is things like response times, throughput, error rates, the stuff users actually feel. You want business metrics, which are like transaction success rates, revenue impact, customer satisfaction, the stuff that the execs care about. And then you also need a fourth example, which is migration specific metrics like replication lag, dual write consistency, traffic split percentages, the stuff that's unique to the phase that you are currently in.

And here's how it plays out. You're running 30 % of traffic on cloud, 70 % on legacy. Your dashboard should show side-by-side response for time comparison, error rate delta, cost per transaction and replication lag. And critically, alerts that understand context. Don't alert on cloud error rates until you're sending meaningful traffic there. Don't alert on high costs during dual run phase when your finance told you to expect that.

One of the biggest problems I see teams instrument everything alert on everything and then ignore everything because it's too noisy. You need alert tiers. You've got P zero where you need to wake someone up where you've got customer impact, money loss, data corruption. P one, which is like invest, investigate during business hours where you've got performance degradation, elevated errors or resource exhaustion approaching. You've got P two where look at it when you get to it.

Interesting anomalies, potential optimisations, nice to know stuff. Because most migrations set everything to P0 and then wonder why everyone ignores alerts.

So it's better to under instrument with good alerts and over instrument with noise. I'd rather have 10 reliable alerts that mean something than a thousand alerts that might mean something. So focus on SLIs that map to user experience, business KPIs that map to revenue, and migration health metrics that predict problems. Everything else is optional. And speaking of things going wrong, let's talk about what happens when your migration fails.

What happens when AWS, Google or Azure just as happened the last couple of weeks, stop working?

So as I mentioned, on October the 20th, 2025, AWS, US East one goes down 15 hours, Snapchat, Roblox, Fortnite, Signal, Coinbase, Robinhood, Venmo, McDonald's mobile app, the UK government services, they all dropped offline. And the cause? problems, it's always DNS problems. But this time it really was DNS problems with a DynamoDB endpoint.

was one subsystem, one single region, and cascading failures across 100 plus AWS services.

This isn't theoretical. This literally happened last month. Google Cloud had a 14 hour IAM failure in June 2025. 100 % unavailability for new authentication. Gmail, Drive, Calendar, Meet, Search, all down. Even their status dashboard went offline. That CrowdStrike incident in July 2024. 8.5 million Windows systems. $5.4 billion in Fortune 500 losses.

and a bunch of airlines stuck, people stuck in the wrong place yet again. So the question that everyone asks is, should we go multi-cloud? And here's the answer to that question, which is hell no. Okay, probably not, but like almost definitely not. So I know multi-cloud sounds great, no vendor lock-in, resilience, best of breed, but.

Here's the reality, 89 % of organizations used multi-cloud in 2024. And you know what the research shows? The complexity often outweighs the benefits. Netflix runs entirely on AWS. Minutes of downtime per year, they serve hundreds of millions of users across 190 countries. One cloud provider. And how do they do that? Chaos engineering. So these guys built Simeon Army.

I don't know, this is open source, so people can go and test this as well. The chaos monkey, if people haven't found it, go and find it. Cause it randomly terminates production instances during business hours, which forces teams to build redundancy and automation. Chaos King simulates entire region failures and tests catastrophic recovery. And they align their entire engineering culture around one question, which is what if it fails?

And because they run everything in one cloud, they can go really deep. They understand AWS inside and out. They know the failure modes. They've built custom tooling. They've optimized for that very specific platform. And then of course, if you compared that to multi-cloud where you've got separate IAM models for each cloud, you've got different network security models, different monitoring platforms, different billing systems, compliance frameworks, teams need expertise in multiple clouds, no volume discounts because you've split workloads.

The operational overhead is like three to five times the typical cloud operations expenses. Now, of course, there are reasons for multi-cloud. If you are part of a merger and acquisition and you inherit different clouds, mean, that's probably fair enough until you can unpick that. Regulatory requirements where you need data in specific regions only one provider has, that also may be a valid reason. Best of breed services, want GCPs,

machine learning, AWS is mature compute, Azure is Microsoft integration. That's maybe a reason, but you also have to understand the costs. And negotiating leverage because you wanna keep, you know, vendors honest on pricing. I get it. But for most organizations, multi-AZ within a single cloud gets you 80 % of the resilience benefits at 20 % of the complexity.

So before you go multi-cloud, try this. Multi-AZ first, protect against data center failures, which causes the most common outage type. Then you've got multi-region cloud, which you can do second, which would then protect against regional failures like US East 1 going down. Then you've got chaos engineering, which sounds fun. So actively test your failure scenarios. And then last but only last,

Multi Cloud.

The secret of course with cloud outages is that most aren't infrastructure failures. They're configuration failures. Azure's 2024 outage, configuration update plus incomplete VM allow list and a DDoS attack is a perfect storm. Cloudflare routine update exposing hidden flaws in complex systems. The lesson isn't clouds are unreliable. It's that complex systems fail in complex ways. Multi-cloud doesn't fix that, it multiplies it. So if you want resilience,

Invest in graceful degradation. Can you run without some services? Circuit breakers, like can you isolate failures? Bulkheads, are your systems isolated from each other? Observability, can you see problems before users do? And chaos engineering, do you test failure scenarios? Because these work regardless of how many clouds you use.

Alright, nearly there. Let's talk about the thing that kills more migrations and bad architecture, which is course communication failures.

TSB Bank had 1,400 people on their migration. You know what they didn't have? Proper communications channels. Teams didn't have access to the source systems. Documentation was locked away. Subject matter experts were unreachable. One billion pounds later, they'd learned that lesson.

So here's a communication playbook. Weekly stakeholder meetings with an open agenda, anyone can add topics. Progress dashboard, which is visual and not spreadsheets. Pending decisions on what's blocked and why. And clear DRIs on who owns what.

Then you've got automated transparencies, so real-time migration dashboards, self-service status, and visual progress.

Then you've got tailored communication by audience. Engineers want data and metrics, executives want business impact, and users want what's in it for me. And here's the bit that nobody tells you, you need to over communicate during these migrations, not send weekly emails, over communicate. Take stakeholders through vision, design sprints, demos, success metrics, and rollout strategy.

Because migrations have no external users giving feedback. You have to create that feedback loop internally.

Three communication gates that you must have. Commitment gate, which is formal approval before starting, in writing with success criteria. The cut over gate, formal approval before production rollout with rollback authority defined. And the completion gate, which is formal sign off after validation. Not it's done, but it works and we've proven it. And these aren't bureaucracy, they're circuit breakers that prevent disasters.

All right, let's finish this one off. Cloud migration without chaos comes down to three principles. Productify it, assign a product manager. Use alpha, beta, and prod phases. Gamify the progress, measure success. Phase it. Strangle a fig over big bang. Dual write, dual read, gradual rollout. Feature flags for infrastructure. Protect it. Instant rollback capability. Unified monitoring.

Chaos engineering over multi-cloud complexity. The companies that succeed, Netflix, Spotify, Pinterest, they didn't have better technology. They had better process. They treated infrastructure migration like launching a product with phases, with metrics, with the discipline to go slow so they could go fast.

So I have some homework for you. Who owns your next migration? For real, is it someone who lives with the results? Can you roll back your most critical system in under 10 minutes?

And what's your strangler fig strategy? What gets migrated first and why? Answer those three questions before your next planning meeting. Remember, TSB tried to do it in a weekend and they're still paying for it. Netflix took seven years and they built the largest streaming platform in the world during that migration. So speed is a trap, discipline is what wins.

Thanks for listening to Engineering Evolved. If this was helpful, make sure you check out conceptocloud.com where we tackle a number of these common issues. But this podcast itself, share it with someone who is about to go and make a very expensive mistake. And until next time, go slow to go fast.


Episode Video