Like every software engineer, I spend a lot of time looking online for solutions to problems (usually via Google, but more recently Kagi and Claude). I am eternally grateful to other engineers who have taken the time to write about their solutions. Especially those who cover the why as well as the how.
In that spirit, I'm going to breakdown the process I've been using for the past few years to ship code into production for the SaaS products I'm involved with. They're all Python and hosted in AWS, but that's not required for this approach.
Local Development
I spent many years dealing with a development server and database in the cloud, where the whole team used svn (with a fantastic homegrown web front-end that preceded Github by many years) to create branches and do their work. It generally worked fine, until somebody broke the database or the development server collapsed.
I'm now a huge fan of Docker-based local development. Whatever language and database you're working with, if you can get everyone running your application locally through Docker, it's going to be a positive experience. A simple compose file to get everything launched, VS Code for editing, and no central resource for everyone to worry about.
I'm increasingly a fan of making our software 100% local, back to that "famous" 5-minute Wordpress install I talked about yesterday. That means a developer should be able to get the whole thing running locally via Docker, and the whole application should work without any AWS access. That may sound simple, but it means abstracting certain features and sometimes not using AWS services. For example, if you upload to S3 in production you'll need to abstract that in your code so that somebody running locally can still upload and retrieve files from local disk.
They should also be able to run it without an internet connection. If your team want to work in the middle of nowhere, they should be able to. That means vending all your JS and CSS libraries, rather than loading them from public CDNs.
You'll be surprised how you naturally get better practices from following that approach.
Git Process
I want the interaction with Git to be as minimal as possible (one of the reasons for not using GitFlow) and am a happy user of the Github Desktop application. I want to minimise merge conflicts and I want to minimise the cognitive load for the team.
With that in mind, our process is simple. The development team branch off main
, do their work in a feature branch, and then merge back to main
again.
We always consider the main
branch to be deployable. Now what does deployable mean?
- We don't merge half-complete implementations. And if we do, they're hidden behind feature flags/configuration values.
- We have automated tests (both unit tests and integration tests via Playwright) to make sure that nothing is broken.
- Every job goes through a peer-review process before being merged (via Github).
- There's no cherry-picking. Once it's in
main
, that's it going to be deployed. You may occasionally roll something back again if we're not quite ready for that, but you can't decide to only move some ofmain
forward into production.
Once a job is merged to main
then we run a build process that creates a Docker image. This happens via a combination of CodePipeline and CodeBuild, and the resulting image is uploaded to ECR. But you could easily do this with Github Actions and any private Docker registry.
The image gets tagged with the commit-id from Github and the newest image is always tagged with latest
.
We have separate development and production AWS accounts (and you should too), but the important thing is that the resulting image from the build process is shared to both environments. ECR lets you create those cross-account sharing rules.
However, the image is only deployed automatically to the development environment because we have stakeholders.
Stakeholders
What/Who are stakeholders?
These are the people that care whether or not your software is any good. They may be project managers. They may be product owners. They may be a QA department. They may be customers, depending on how you sell your software.
Despite all your tests, peer-reviews and developer best-efforts - stakeholders want to see your hard work before it's pushed out to the whole world. Just because the code passes your tests doesn't mean it's good.
They're also not technical, so they can't pull the code down themselves and build their own Docker containers locally. Or maybe they could, but I'm assuming the thought of having to support that gives you the fear.
So we have a development environment they can access to see the latest version of the application. They can clearly see what's going to be deployed next.
And because the Docker image is shared between the development and production environments, it's exactly the same code that's going to be deployed to both. There's no chance of them seeing something in dev, and then something else getting deployed to production. Your QA team will like that.
Deployment
We use CodeDeploy with Fargate on ECS. It's blue/green, so we deploy a complete set of new containers and make sure they're OK before updating the load balancer to point at the new IP addresses and destroying the old ones. CodeDeploy manages all of this for us, which is fantastic.
However... this would work just as well with Github Actions SSHing into a box in Digital Ocean. It's just a Docker image you need to put on your server.
Production Deployment
Assuming everyone is happy and it's time to do a deploy, we add a release
tag on to the image within ECR in production. This automatically starts another CodePipeline/CodeDeploy combo that deploys that image with the same blue/green process.
We try and deploy once or twice a week, but that does rely a lot on the stakeholders.
Rollback
Easy, just tag a previous image with the release
tag to deploy that one instead.
Database Migrations
We use Alembic (with Flask), and we run the Alembic upgrade command within CodeBuild in the CodePipeline before CodeDeploy. If this step fails, we stop the pipeline and don't deploy the image and then a DBA needs to take a look at the database to see what happened. Since this is all within CodeBuild, it's easy to look at the logs in Cloudwatch and see why the command errored.
This also means that we consider our database migrations to be a pre-deploy step, not a post-deploy step. The important thing to know about that is that database migrations have to be backwards compatible, as your database is going to change before the code does. If the migrations work, but the deploy fails, you still want the application to continue running even though you added additional columns.
The alternative would be a post-deploy step, but then your new code goes live and it needs to be able to cope with the columns it relies on not being there yet. And that's harder to work with (tried that!).
Pre-deploy also means that you should do migrations that drop columns separately. Consider migrations to be additive, and then worry about cleanup migrations to remove deprecated fields later.
Pre-deploy also makes it easier for developers to test locally, as they can run the migrations and then switch back to the main
branch and run tests to confirm everything is still fine.
And backwards compatible migrations is what makes rollbacks easier.
Half-completed migrations can be a pain to resolve, which is a good reason to use PostgreSQL instead of MySQL, since PostgreSQL supports transactional DDL.
Hotfixes
Sometimes you need to fix a bug in production but you don't want to do a full deploy, because a full deploy will also drag along everything else that's been deployed to main
since your last production deploy and your stakeholders aren't ready for that yet.
To resolve this we have a CodeBuild configuration that builds an image whenever a branch with "hotfix" in its name has some code merged into it. It goes something like this:
- Follow the usual process for creating a branch off
main
and fix your bug. - Merge it back into
main
. - Create a new branch off
main
calledhotfix-something
. - Create a branch off
hotfix-something
calledmy-urgent-fix
. - Cherry pick the commit from
main
containing your fix and put it into yourmy-urgent-fix
branch. - Do a PR to pull from
my-urgent-fix
intohotfix-something
.
When that PR is merged, Codebuild sees the name of the branch and creates a new image using the usual process and tags it as normal with the commit id, but doesn't deploy it to development (because the development environment already contains the fix, as the fix is already in main
). And then we just tag that as release
and deploy it to production, then delete the hotfix branch.
Alternative Flows
There are plenty other approaches, so to round this out I'll give a few reasons why we don't use them.
- Github Flow. If you don't have stakeholders, this is as simple as it gets. But if you have any team at all, I think it's useful for people to see something in a preview environment before it goes live. I'm also not ready to trust that tests are 100% reliable.
- GitFlow. It's so complicated! There's so much merging going on between steps, the whole thing just creates a spaghetti mess of commit history, followed by the very high likelihood that you'll screw something up somewhere. Worst of all, what gets deployed to production is not the same as was deployed to development, because there's a merging step between branches done along the way.
- OneFlow. This one is pretty good for the stakeholder lifestyle, because it has release branches that you could deploy for them to see before moving them forward. It is a bit more complicated than what we're doing, and while release previews might seem straightforward, they're incredibly difficult to create a database for.
Conclusion
It works for us! It may not work for you, but it's been a really positive experience for our team.
Pull previews are the next step, to automatically spin up a preview environment (using the local docker compose file that developers use) on EC2 when a Github PR is created. I'll write more about that when the time comes.