Production Oriented Development
Throughout my career, I’ve developed some opinions. Some have worn particularly deep ruts,
reinforced by years of experience. I tried to figure out what these
had in common, and it’s
the idea that code in production is the only code that matters. Staging doesn’t
matter, code on your laptop doesn’t matter, QA doesn’t matter,
only production matters. Everything else is debt.
This perspective probably comes from years sitting in between
operations and product development. I strongly
believe that teams should optimize for getting code to production as quickly as
possible as well as responding to incidents in production.
This idea, and a lot of the practices it implies, can be
counter-intuitive or controversial, so I want to dive into them a little
further. What follows is a set of practices and principles I believe are true,
considering my underlying belief that code working in production is the only
code that matters.
1. Engineers should operate their code.
Engineers are the subject matter experts for the code they write and should be
responsible for operating it in production. In this context, “operating” means
deploying, instrumenting, and monitoring code as well as helping to resolve
incidents related to or impacting that code. The responsibility of operating
code aligns incentives - it encourages engineers to write code that is
observable and easy to debug, and connects them to what customers really care
about. It encourages them to be curious about how their code is performing in
production. Importantly, engineers should be on-call for their code - being
on-call creates a positive feedback loop and makes it easier to know if their
efforts in writing production-ready code are paying off. I’ve heard people
complain about the prospect of being on-call, so I’ll just ask this: if you’re
not on-call for your code, who is?
If you’re not currently on-call for your code but want to be, and can help
influence this decision, there are some things you can do. Set up PagerDuty (or
similar) schedules for each group of engineers responsible for specific
services or parts of your code. A good schedule has 6–8 engineers. There are
plenty of variations, but a typical template is to have one-week rotations,
where you’ll be on-call for secondary for a week and then primary for a week.
Configuring alerts is a separate topic, which probably deserves it’s own blog
post entirely, but focus on things that impact your customers (see:
Symptom-based alerting) and remember that you’re ultimately responsible for how
you respond to alerts, which means you can change them.
There are two talks I’d recommend watching that touch on the topic of
configuring alerts: Liz Fong-Jones talks about SLOs in Cultivating Production Excellence and Aditya Mukerjee does a great job talking about techniques for
managing alerts in Warning: This Talk Contains Content Known to the State of California to Reduce Alert Fatigue.
2. Buy Almost Always Beats Build
If you can avoid building something, you should. Code is the most expensive
way to solve a problem that isn’t addressing a core area of your business. For
most small to mid-sized companies, there are open source or better yet, hosted
solutions that solve a wide range of common problems. I mean things like git
repository hosting (Github, Gitlab, Bitbucket, etc), observability tooling
(Honeycomb, Lightstep, etc), managed databases (Amazon RDS, Confluent Kafka,
etc), alerting (PagerDuty, OpsGenie, etc) and a whole host of other commodity
technologies. This even applies to your infrastructure - if you can help
it, don’t roll your own Kubernetes clusters (side note: do you even need to use
Kubernetes?), don’t roll your own load balancers if you can use Amazon ELB or
ALBs.
Unfortunately, NIH syndrome is very real and some companies get burned badly by
this. I’ve seen teams light time and money on fire reinventing components when
better, more battle-tested alternatives exist in the market. Those same teams
almost always end up spending years contending with the resulting technical
debt. If you’re on such a team and have the will and ability to impact change,
start rolling back these decisions one by one. Migrate your databases to a
managed provider, migrate your feature flagging system to a SaaS tool (i.e.
LaunchDarkly). Keep going until the only software you maintain yourselves is
the software that delivers value to your customers. You’ll be much, much better
off for it.
3. Make Deploys Easy
Deploying should be a frequent and unexciting activity. Engineers should be
able to deploy with minimal manual steps and it should be easy to see if the
deploy is successful (this requires instrumenting your code for observability,
which - tada - is covered above), and it should be easy to roll back a deploy
if something doesn’t go well. Deploying frequently implies that
deploys are smaller, and smaller deploys are generally easier, faster and
safer.
Many teams implement periods where deploys are forbidden - these can be
referred to as code freezes, or deploy policies like “Don’t deploy on Fridays”.
Having such blackout periods can lead to a pile-up of changes, which increases
the overall risk of something going very wrong.
If you’re on a team that fears deploys, dedicate a percentage of your
engineering time to improvements in your deployment pipeline until the fear is
gone. On a recent team I worked with, we were able to improve deploy times from
3 hours to 30 minutes, which drastically improved the teams’ confidence in the
deploy process. A natural side effect of this was that engineers started
deploying much more frequently instead of waiting for changes to pile up enough
to warrant a “release” (which was synonymous with a deploy).
The book
Accelerate
has been getting a lot of attention. If you haven’t read it, I’d recommend it.
The team behind it also publishes the State of DevOps
reports, which are full of well-researched information about what various
companies in the industry are doing. It’s not a coincidence that two of the
four key metrics that the book focuses on are directly related to this (Deploy
Frequency, Change Lead Time). Shipping is your company’s heartbeat.
4. Trust the People Closest to the Knives
The people who work with a system are the ones who understand it best. This
applies to any part of the socio-technical systems within which we all work. In
the case of software systems, the engineers who deploy every day and are
on-call for critical services understand the level of risk they operate in. A
sad trend is that managers tend to overestimate their teams’ progress on
certain transitions - i.e. cloud-native, DevOps, etc. The higher up the
management chain, the larger this overestimation tends to be. Engineers who
deploy and get paged when things break know where the bodies are buried and
they know what needs the most work. They should, therefore, be the primary
stakeholders responsible for prioritizing technical work.
Another manifestation of this principle applies to platform or services teams.
If you’re responsible for building some shared component that’s used within
your organization (i.e. a messaging system, ci/cd infrastructure, shared
libraries or services) there’s an uncomfortable truth lurking for you: the
people who use your work know more about it than you do in many cases. They
understand implicitly how it serves customers and they know what contortions or
hoops they have to jump through to get it to work. Listen to them for
clues on how to improve the UX of your services and tools.
5. QA Gates Make Quality Worse
Many teams have a manual QA step that gets performed before deploys. The idea,
I guess, is to have someone run automated or manual tests to verify that a set
of changes are ready to be released. This sounds like a comforting
idea - having a human being (or team of human beings) “verify” a release before
it goes out - but it falls victim to several false assumptions and creates
some misalignments that do more harm than good.
First of all, if there’s manual work that needs to be done before a deploy can
go out, that creates a bottleneck - if you’re making deploys easy, and
deploying small changes frequently, no QA team is going to be able to keep up
testing every deploy, and will inevitably block teams from deploying. That’s no
good. If you have manual tests, automate them and build them into your CI
pipeline (if they do deliver value).
Secondly, the teams doing QA often lack context and are under time pressure.
They may end up testing “effects” instead of “intents”. For example, I’ve seen
QA teams burn time testing that when something happens in a UI, something
related happens in a database. What happens when an engineer refactors that UI
component and changes the underlying data model? The functionality works, but
the test breaks. Because two teams are involved, this takes coordination and
time to fix. Similarly, I’ve seen QA teams block deploys because of failing
tests when caching was introduced at the CDN layer - a TTL of 5 seconds on an
activity feed may not ever be noticed by a user but it might break QA tests
causing unnecessary conflicts between product and QA engineers.
Luckily, solving this one is easy. Instead of having a dedicated QA team work
on creating manual and automated test cases that run in a fictitious QA
environment, reassign that team to work on continuous testing in production.
Instead of being a gate for deploys, a QA team could continuously verify that
production is working as expected. QA teams are also well situated to lead
Chaos Engineering initiatives, where faults are intentionally injected in
production. QA engineers could also work on making the CI/CD pipeline more
reliable, so that deploys are no longer a nightmare.
6. Boring Technology is Great.
With thanks to Dan McKinley,
always strive for boring tech when possible. Systems are inherently
unpredictable, and you want a wide area of expertise to fall back on when shit
goes sideways. There are also routine operations that you’ll have to do
(deploys, database migrations, etc) and it’s Very Nice to have widely used and
tested tooling for this stuff. I think of databases most often when I think
about this belief. MySQL is a database with many, many quirks, but it is so
widely used, that you should still just use it most of the time.
Very few organizations have the bandwidth to debug unique problems. You
don’t want unique problems, especially when performing routine
operations - i.e. storing bytes on disk, choosing a new leader in a cluster,
garbage collecting objects, querying time-series data, etc. Having unique
problems will kill a small to medium size team. It will sap you of your
creative energy, which is better used creating value for customers who want to
pay you monies for your software. Use your innovation tokens wisely!
Using boring technology means you can lean on a large community of users. Shit
on it all you want, but there are very few PHP issues that someone else hasn’t
already encountered. Nowadays, the same is probably true for sufficiently
widely used versions of Ruby on Rails. I often say that I like to be in the 3rd
cohort of technology adoption. The 1st cohort is the bleeding edge
organization. The 2nd cohort is the people who feel like they can take some
risks. Let those two groups go before you, run into all the big problems, and
then you can go, benefiting from all of their hard-won experience.
7. Simple Always Wins
I don’t have much to say about this, but we’re all writing YAML and JSON
instead of XML and we’re all using HTTP instead of CORBA, RMI, DCOM, XPCOM,
etc. Right? In that same spirit, I’d rather debug problems in a LAMP stack than
a Microservices architecture any day.
Quick sidebar on Microservices: as with so many trends in tech, they are often
sold as a panacea. Let me be clear: Microservices, designed well, solve some
specific problems and as with most solutions to complex problems, involve several trade-offs. If you are going in this direction, I do have opinions on
how you should do it, but I also think you should hold off for as long as you
can.
8. Non-Production Environments Have Diminishing Returns
A more direct heading for this section would be “Non-Production Environments
are Bullshit”. Environments like staging or pre-prod are a fucking lie. When
you’re starting, they make a little sense, but as you grow, changes happen
more frequently and you experience drift. Also, by definition, your non-prod
environments aren’t getting traffic, which makes them fundamentally different.
The amount of effort required to maintain non-prod environments grows very
quickly. You’ll never prioritize work on non-prod like you will on prod,
because customers don’t directly touch non-prod. Eventually, you’ll be
scrambling to keep this popsicle sticks and duct tape environment up and
running so you can test changes in it, lying to yourself, pretending it bears
any resemblance to production.
9. Things Will Always Break
It’s impossible, even undesirable, to avoid failure. Lean into the fact
that failure is inevitable, and focus on how you respond to it. This means
investing in a continuously improving incident response process. There’s no
one-size-fits-all for every company and team, but you should have a good idea
of what to do when things go wrong, and you should have mechanisms in place to
learn from those situations and improve your processes. Invest in Incident Analysis. It’s a huge field with lots of valuable tools and resources for
maximizing the return on investment when incidents occur (or don’t!).
This is an area where Chaos Engineering can be helpful. Injecting failures into production can improve confidence in how to respond when a system
starts behaving in unexpected ways. Game Days can be a particularly effective
way to allow a team of engineers to practice various outage scenarios.
Conclusion
A lot of the beliefs outlined in this post are at least counter-intuitive, if
not somewhat controversial, but I’m nevertheless convinced that they’re true.
That doesn’t mean my mind cannot be changed, but it is unlikely. If you
strongly agree or disagree, I’m on the internets. I’d be very curious to hear
about your experiences.