Tom Barber (00:01)
Welcome back. ⁓ So today we're going to look at the most dangerous phrase in modern software development, I think, which is it works on my machine. ⁓ worked for those of you who don't know me, I worked at NASA JPL for the best part of a decade. And we have seen all manner of different deployments and people trying to run software in different ways.
And but it works on my machine was by far the most common issue that we ever heard when it came to to getting the thing deployed into a into a different environment that was more production ready. So. In this episode, I'm going to take you from how I went from being a naive developer to properly understanding production deployments and systems and the lessons that shaped everything that came after that. And of course, why?
It works on my machine is the most dangerous phrase in software. Welcome to Engineering Evolved. Let's get into it.
Okay, welcome back to episode two. This one, like I said, is why running things on your machine is definitely not the way to be able to validate whether or not deploying something into production is worth doing. And also some ways that may actually help you go about doing this better in the future. My name is Tom Barber and I am your host. I'm happy to have all of you along again.
⁓ And so I'm going to give you some background into sort of where I came from and how I got to this place in my career and hopefully provide you with some horror stories that will stand your best to go forward. So. I didn't think that I was going to be like a modernization director when I started my career, I honestly thought I was going to run Linux systems, I enjoyed running.
Linux servers. And so I thought I give a systems administrator. And that was going to be about it. And so I was applying for a lot of jobs as everyone does when they start out in their career. And I ended up being an Excel jockey for want of a better term. And my job was to print off the month end reports every month.
stick all these reams of A4 paper into an envelope and send these reports to different parts of the UK so that the people who were supposed to read them, consume them and understand the massive amount of numbers that meant very little to lots of people could actually get the numbers that meant very little to lots of people. And so I was just a very junior developer. All of this was written by the way in ⁓ Visual Basic.
inside of Excel and Microsoft Access. This is how business intelligence was done back then. But you you work in a small team and believe that code was the only thing that really mattered. And actually deploying this stuff into a production environment is someone else's problem. But when your reports, your month-end reports only run on one computer, that is obviously a bit of an issue.
Actually, when this stuff got shipped to a more production ready environment, the deployment was magic that happened at some point after I handed this code off. And so you're naive to the way that these things work, as are a lot of people. And they don't all have to be junior developers to be naive to how these things work. What I thought was important was the ability to write good code following design patterns that made sense.
Trying to put some unit tests in there, optimize it so that we could get these things printed as swiftly as possible. And what I really ignored in all of this was questions like, where does it run?
How does it get there? What happens when it fails and who maintains it? And this is why, you know, these types of things works. If you've got a very small team where everybody knows everybody, everything, everything, and the dev and production systems are basically the same system and...
The person who breaks things fixes it as soon as they're broken. There's no handoffs, the silos, no process. And everyone who goes on vacation definitely has to take their laptop with them just in case. That's sort of the small team setup and is reasonably, I suspect, prevalent across a lot of smaller industries. But this doesn't scale
And you can learn that the hard way. And it gives you like, you know, false confidence. So I thought at the time I was a good developer, my code like ran the tests pass. The reports got printed. It looked good on my screen. You know, so what else matters until you try and deploy that stuff and then everything goes boom. And
especially for things like month end reports, they're generally expected to come out in a reasonably timely manner. And some days when that doesn't work, you have a lot of very irritated people ⁓ calling you up.
It also reminds me of I moved on from that organization. I moved down to London and I was working in a startup and the amount of times that we would be in the pub next door after work and the systems administrator would get a ping from ⁓ a piece of monitoring software that something's gone wrong and he'd have to get literally SSH into the infrastructure from his phone, from the pub to reboot.
something because it ran on someone else's computer but then doesn't run as well in a cloud environment. I suspect, I suspect, I'm going to go out on a limb here, I suspect that we weren't the only startup to do things like that. So if you've worked in a startup and you've administered infrastructure from the pub, you're not alone.
Then a few years later, though, I got a call that changed the direction of my career. And that was a phone call asking me to join NASA JPL for anyone that doesn't know NASA JPL. JPL is a Jet Propulsion Laboratory, laboratory, laboratory. ⁓ And that is based out in Los Angeles, Pasadena, California, and primarily deals with
I have a deep space spacecraft, satellites on orbit, robots, automated stuff. Sadly, and I'm a big fan of manned space flight, so we weren't working on manned space flight, but I did get to work on the Mars rovers, which was cool. Anyway, that's an aside. ⁓ I was working on one project though, where we needed to go through
modernizing or productionizing some code that was used for sniffing carbon monoxide in the atmosphere. So basically what would happen is the planes which had these special like snouts effectively stuck boxes stuck in the front of these planes would fly around and sniff the air and then try and find pockets of carbon dioxide. Now this was a public access project and so they wanted to be able to create a
Google Maps style visualization of the carbon monoxide over the state of California. And so I got roped into this because people had heard that I was alright at deploying stuff into a production environment and was given a bunch of MATLAB code which was used for extracting weather patterns, I believe. ⁓ And so it was good at doing this thing, but
A, I'm not a MATLAB user, which is not great. But B, also to run that type of stuff in production requires licenses and things that ⁓ people would have to then pay for. Being a public facing app, they were sort of less inclined to do that. And they wanted to give access to scientists and interested parties around the globe. So what did I do? Well, I was asked to figure out how to get this thing into production. So I figured, well,
There's enough data science stuff in Python. And so we should be able to port the Matlab to Python. Of course, LLMs did not exist back then, so that took a little bit of trying to figure it out.
Then we would wrap the Python in a web interface and we would deploy it somewhere. How hard could it be?
The actual requirements that we got were a little bit different. And so we were asked to run it on government servers. So for those of you who don't understand AWS, there is Pub Cloud and Gov Cloud. And Gov Cloud is a little bit different in terms of who can access it. I could not because I'm foreign. that threw a wet branch in there. ⁓ Must be able to handle a decent amount of concurrent users. Must be maintainable by JPL staff after I had left.
must not require MATLAB licenses, but also be able to process the real data at scale and be monitored for reliability.
The way that we would try and do this first and the way that it would work would be like, would port the MATLAB code to Python. Got that done. It worked on my Mac. The visualizations were great. The processing seemed pretty decent. And so we took that onto a JPL test server. Of course, what happened at that point?
Um, we had different Python versions, missing dependencies, different file system paths, memory constraints. Everything was different. I've gone from running on a Mac to running it inside of an EC2 VM. Uh, the code didn't just fail. Also it failed in ways that I couldn't reproduce. Like some of the test suites ran fine in different environments, but like it was never super reliable. And so this became quite an interesting.
troubling debugging problem where my bosses would be like, it's not working. And I would ask what the error is. And then I'd be told it just stops. But it works. It works on my machine. And my boss would be like, I don't have your machine. And so this would be something that went around quite often. I couldn't access the JPL servers. As I mentioned, GovCloud is a restricted environment. Foreign nationals were not allowed into it.
He couldn't give me the exact error messages because I didn't know where to look. I couldn't reproduce the environment locally. And so every fix I sent took days to test and every fix revealed another environment difference. It was a bit of an ongoing problem.
Then one day I was sat around, reasonably frustrated, and I realized, you know, my development environment was a lie. And production is really the only environment that matters. And so works on my machine ends up being quite a meaningless phrase. Environment differences are not edge cases. They're the main case. And so I've been like effectively writing code in this fantasy world. And the reality was about to teach me some hard lessons.
OK, so retrofitting quite often, if not quite often, always fails. Like the failed approach, we were trying to fix the MATLAB port. We added environment detection into it, a lot more hooks and checks, writing installation scripts, bash hell hell. We documented for every edge case and each new fix created more problems. It's a compounding effect. And so we needed to pivot and start over. So what I learned
You know, what I learned I needed was think about production from day one. Don't think about production as an afterthought. Think about what that target is going to be. If you're selling this to a customer, what operating systems are they going to be running? Does it run as a SaaS product? Does it run on a server somewhere? Does it run on users' hardware? All these types of things are important.
You've got to match the development environment to production as close as you possibly can, or at least have like, you know, local-ish test environments that you can test this stuff on. You need explicit dependency management, logging, monitoring, error handling. Yeah, these are not afterthoughts. They all become core architecture decisions. And this is important as you grow both from a career perspective, but also from a company perspective. And so...
And this is going back quite a few years. And so it wasn't super obvious back then, but we rebuilt this thing starting with Docker containers. this Docker was definitely this infancy. And it was interesting looking around the group with wide eyes as people were trying to figure out what we were going to do with Docker considering no one had ever done stuff with it before. We had development staging and production basically all
running the same container. we could develop stuff on a laptop. We can mount that stuff into a development container. We could check it. We could make sure it's valid, make sure everything's working. We've got explicit requirements. We've got environment variables that better handle configuration. And then we put some more thought and effort into handling errors, monitoring, and that type of setup. And so because this was running in the cloud, we did have the access to a number of
you know, cloud monitoring services. And so better hooks out the gate rather than taking someone's code that worked on his laptop, sticking it in the cloud and hoping the monitoring worked, you know, made a big difference. And so we ended up then at a reasonably good breakthrough moment. The second time we tried to spin this thing up properly, we pushed the container into a container registry. We deployed it into AWS.
And my American compadres at least started the container and it worked. Of course it worked first time without debugging and no works on my machine because it did work on my machine, but it worked inside of the same container that was then deployed into production. And so why did it work? Well, yeah, because the development environment was the production environment. There's no mystery surprises, no mystery failures and
you know, it's, it's reproducible. And so if there are problems, you can still run that stuff locally. You can then build it, deploy it, send it on its way. And assuming that your deployment environment is relatively sane outside of like configuration issues, you should end up getting like for like, ⁓ execution of your, of your software.
So what lessons did we learn from this? Production-first thinking is an important one. Start with production constraints. In our case, we've got a bit of a black box with GovCloud. We've got different operating system versions, hardened stuff, those types of things. And build a development environment to match that environment, not the other way around.
The environment is part of the code and you see this more today, but of course we were talking, you know, eight, nine years ago, dependencies matter as much as the code and configuration is part of the system. You can't separate code from where it runs. And so try and think of the entire environment as an ecosystem and how you're going to best enable the deployment of all these things into whatever the target is.
Handoffs require different thinking, like code I maintain versus code that someone else maintains. Like what does that look like? And so you end up getting away from the it runs on my machine type mantra because you need to think about it as a platform that whilst you may write it, you do not own it. People have different requirements. The documentation is not optional. Like you have to be able to document this process. Monitoring.
especially for a product that's going to scale isn't optional. And maintainability is basically a primary concern. Like not now, but like in six months, 12 months, two years, like is this software still going to be maintainable? If I leave and someone else comes in, is there enough in place for them to understand what the deal is and how they're going to be able to upgrade it?
And then we've got like the research code problem. The MATLAB was unmaintainable for a number of different reasons. It was written for one researcher on one machine, had assumptions baked in everywhere, hard code, coded paths, all that type of stuff. There's no separation concerns. There was code and data mixed together. There's no error handling. There's no logging. The code wasn't bad. And I've seen this many times. The code wasn't bad. It's just, it was research code.
Research code optimizes for flexibility and experimentation. Production code optimizes for reliability and maintainability. And you cannot retrofit production thinking onto research code. ⁓ must effectively start again, but with production in mind.
So this story started with me as a junior developer and then moving on to NASA and some learnings I took away from both of those ⁓ places of work. But this story isn't about NASA JPL. This is basically about every mid-sized company I've worked with since. The pattern is very similar. You start small and the code works. You grow, the code still works on some machines.
You grow more and then you start getting mysterious failures. They may be environment related. They may be load related. There's a number of different reasons. Then you end up with different machines behaving differently. Then you get, it works in dev, but doesn't work in staging. Then you get, it works in staging, but not in production. And engineers are debugging environments instead of building new features. And it slows down the development. It slows down your ability to get new features shipped. And so...
You need to work on this common problem. Now, why do mid-sized companies hit this hard? Like startups, everyone's on the same machines, use the same operating system. They deploy to a known environment. When it breaks, the person who wrote it fixes it. You know, they're going to be sat there. They're going to watch the development, the CI CD pipeline roll out. And then they're going to realize it's broken. Go and fix the thing, push a fix. The build happens again.
It's back up and running.
If you're an enterprise environment, you've got standardized environment. You've got configuration management teams, not just like a single person or the full stack developer, like editing some YAML, you've got full configuration management teams. You've got a whole bunch of infrastructure as code and you've got a dedicated DevOps team. And so this is why it hits the midsize companies more than it hits the startups and the enterprise, because these organizations are designed for these types of things. You...
as an organization have likely a mix of Mac and Windows developers. ⁓ I like using Mac. My old cloud engineer would like to use Windows. And so again, you've got different ways of building, deploying, and testing these things. Then you've got a mix of contractors and full-time developers, some people using remote desktops, some people using their own hardware. You've got a mix of new and legacy systems.
So you're lacking standardization. You're lacking a dedicated infrastructure team, but you need to be able to build and deploy to production in a reliable manner.
So here's the solution framework that I'm going to suggest to you at this point. Treat development like it's a production environment. Use containers or VMs. I'm a massive proponent of using DevPod or similar ⁓ to be able to build this stuff out in a way that allows you to ensure that you're using the same base images as you would in a production environment.
And then every developer then has to use Devpod or whatever tool you're using so that they run the same environment. The dev container spec is there for a reason because especially as you've got people coming and going or running on different hardware, it makes the works on my machine possible.
Make sure you've got explicit dependencies, lock file for every dependency, version everything. Don't use latest tags because things will get ruggpulled or things will upgrade and you won't know about it. And try not to use system dependencies. Cause again, when you're switching things out inside of an image, ⁓ you can't necessarily test for everything. But if you upgrade a C library and that Python thing calls it once in a month for Sundays, you may not notice until it falls over.
Make sure that you keep your configuration flexible. Make sure that where you can that you use environment variables, make sure that you're using configuration files. Do not ever hard code stuff into your code base. And definitely, definitely never commit secrets. Make sure that you have scanners and things set up inside of your Git repos so that it does pick these things up. If you want another tip, use things like pre-commit.
which will stop the commits even getting committed. So if you're liable for committing secrets or something, possibly a pre-commit hook, that type of thing for the team will sort these types of situations out. And then think about production from day one. Make sure that you've got logging built in, monitoring built in, health checks built in, error handling built in, and don't try and retrofit it later, because all of that stuff is so hard to put in.
once you've got the platform up and running. And, you know, there's a large cost to getting this thing wrong. I've seen companies spend months debugging environment issues, lose good engineers in frustration because of the woes of operating in that environment, miss deadlines because deploys are scary, have outages from configuration drift, and build features that work in dev but not in production.
They're all preventable with production first thinking from the start. I remember one open source package product that I used where the test suite would pass if you compiled the software and run the test suite in Pacific time, but not if you run it in GMT. And these are the nuances that people see. Like it works on my machine is a real thing. It never used to compile on mine, but if I set the date to Pacific time,
It would run even though there was nothing in there that was like the test wasn't failing. So it's checking for a certain time zone. It was super weird, but those types of edge cases are very real.
So there we go. The lesson that shaped everything, the NASA JPL project failed spectacularly at first because we tried to run this research code in a productionized environment. We then moved on. We succeeded with the second containerized approach, even though containers time were in their infancy. Same code, different approach. The difference, the first time we thought that the code is everything. The second time,
we realize that the environment, of course, is part of the system that you're trying to integrate into. Why am I telling you this? Every technical leader I have ever worked with has had a moment like this. If you go down to the pub and discuss war stories with other technical leaders, you'll realize that there's very similar stories. ⁓
The moment you realize that your code alone isn't enough, the moment you realize that production is different, the moment you realize that scale changes everything. Mine was a phone call saying it didn't work. ⁓ When will yours be? Maybe it's already happened. If not, I'm sure at some point in the future you will see or have a similar war story to tell. But there is a better question. Can you learn from my mistakes instead of making your own?
For your company, are your developers thinking about production or are they thinking about it working on their machine? Are you retrofitting production thinking or building it in from day one?
The time to fix this is now, not after the outage, not after the customer loss, not after the engineer quits infestation. You need to do it now.
I hope you enjoyed this episode, episode two of Engineering Evolved. My name is Tom. I hope you come back for the next iteration of this podcast. And until then, drop me a message, let me know what you think, and also share your horror stories with me. I would love to hear some more of them. And goodbye for now.