In Pursuit of Production Minimalism — Brandur Leach
While working at Lockheed during the cold war, Kelly
Johnson was reported to have coined KISS (“keep it
simple, stupid”); a principle that suggests glibly that
systems should be designed to be as simple as possible.
While complexity is never a conscious design goal of any
project, it arises inherently as new features are pursued
or new components are introduced. KISS encourages designers
to actively counteract this force by making simplicity an
objective in itself, and thus produce products that are
more maintainable, more reliable, and more flexible. In the
case of jet fighters, that might mean a plane that can be
repaired in the field with few tools and under the
stressful conditions of combat.
During his tenure, Lockheed’s Skunk Works would produce
planes like the U-2 and SR-71; so notable for their
engineering excellence that they’ve left a legacy that we
reflect on even today.
Minimalism in technology
Many of us pursue work in the engineering field because
we’re intellectually curious. Technology is cool, and new
technology is even better. We want to be using what
everyone’s talking about.
Our news sources, meetups, conferences, and even
conversations bias towards shiny new tech that’s either
under active development or being energetically promoted.
Older components that sit quietly and do their job well
disappear into the background.
Over time, technologies are added, but are rarely removed.
Left unchecked, production stacks that have been around
long enough become sprawling patchworks combining
everything under the sun. This effect is dangerous:
-
More parts means more cognitive complexity. If a system
becomes too difficult to understand then the risk of bugs
or operational mishaps increases as developers make
changes without understanding all the intertwined
concerns. -
Nothing operates flawlessly once it hits production.
Every component in the stack is a candidate for failure,
and with sufficient scale, something will be failing all
the time. -
With more technologies engineers will tend to become
jacks of all trades, but masters of none. If a
particularly nefarious problem comes along, it may be
harder to diagnose and repair because there are few
specialists around who are able to dig deeply.
Even knowing this, the instinct to expand our tools is hard
to suppress. Oftentimes persuasion is a core competency of
our jobs, and we can use that same power to convince
ourselves and our peers that it’s critical to get new
technologies into our stack right now. That Go-based HA
key/value store will take our uptime and fault resilience
to new highs. That real-time event stream will enable an
immutable ledger that will become foundational keystone for
the entire platform. That sexy new container orchestration
system that will take ease of deployment and scaling to new
levels. In many cases, a step back and a moment of
dispassionate thought would reveal that their use could be
withheld until a time when they’re known to be well vetted,
and it’s well understood how they’ll fit into the current
architecture (and what they’ll replace).
Through ephemeralization
In his book Nine Chains to the Moon (published 1938),
inventor R. Buckminster Fuller described the idea of
ephemeralization:
Do more and more with less and less until eventually you
can do everything with nothing.
It suggests improving increasing productive output by
continually improving the efficiency of a system even while
keeping input the same. I project this onto technology to
mean building a stack that scales to more users and more
activity while the people and infrastructure supporting it
stay fixed. This is accomplished by building systems that
are more robust, more automatic, and less prone to problems
because the tendency to grow in complexity that’s inherent
to them has been understood, harnessed, and reversed.
For a long time we had a very big and very aspirational
goal of ephemeralization at Heroku. The normal app platform
that we all know was referred to as “user space” while the
internal infrastructure that supported it was called
“kernel space”. We want to break up the kernel in the
kernel and move it piece by piece to run inside the user
space that it supported, in effect rebuilding Heroku so
that it itself ran on Heroku. In the ultimate
manifestation of ephemeralization, the kernel would
diminish in size until it vanished completely. The
specialized components that it contained would be retired,
and we’d be left a single perfectly uniform stack.
Realistic? Probably not. Useful? Yes. Even falling short of
an incredibly ambitious goal tends to leave you somewhere
good.
In practice
Here are a few examples of minimalism and ephemeralization
in practice from Heroku’s history:
-
The core database that tracked all apps, users, releases,
configuration, etc. used to be its own special snowflake
hosted on a custom-built AWS instance. It was eventually
folded into Heroku Postgres, and became just one more
node to be managed along with every other customer DB. -
Entire products were retired where possible. For example,
thessl:ip
add-on (providing SSL/TLS terminate for an
app), which used to be provisioned and run on its own
dedicated servers, was end-of-lifed completely when a
better (and cheaper) option for terminating SSL was
available through Amazon. With SNI support now
widespread,ssl:endpoint
will eventually follow suit. -
All non-ephemeral data was moved out of Redis so that the
only data store handling persistent data for internal
apps was Postgres. This had the added advantage of stacks
being able to tolerate a downed Redis and stay online. -
After a misguided foray into production polyglotism, the
last component written in Scala was retired. Fewer
programming languages in use meant that the entire system
became easier to operate, and by more engineers. -
The component that handled Heroku orgs was originally run
as its own microservice. It eventually became obvious
that there had been a time when our microservice
expansion had been a little overzealous, so to simplify
operation, we folded a few services back into the hub.
To recognize the effort that went into tearing down or
replacing old technology, we created a ritual where we
symbolically fed dead components to a flame called a burn
party. The time and energy spent
on some of these projects would in some cases be as great,
or even greater, as it would for shipping a new product.
In ideas
Practicing minimalism in production is mostly about
recognizing that the problem exists. After achieving that,
mitigations are straightforward:
-
Retire old technology. Is something new being
introduced? Look for opportunities to retire older
technology that’s roughly equivalent. If you’re about to
put Kafka in, maybe you can get away with retiring Rabbit
or NSQ. -
Build common service conventions. Standardize on
one database, one language/runtime, one job queue, one
web server, one reverse proxy, etc. If not one, then
standardize on as few as possible. -
Favor simplicity and reduce moving parts. Try to
keep the total number of things in a system small so that
it stays easy to understand and easy to operate. In some
cases this will be a compromise because a technology
that’s slightly less suited to a job may have to be
re-used even if there’s a new one that would technically
be a better fit. -
Don’t use new technology the day, or even the year,
that it’s initially released. Save yourself time and
energy by letting others vet it, find bugs, and do the
work to stabilize it. Avoid it permanently if it doesn’t
pick up a significant community that will help support it
well into the future. -
Avoid custom technology. Software that you write is
software that you have to maintain. Forever. Don’t
succumb to NIH when there’s a well supported public
solution that fits just as well (or even almost as well). -
Use services. Software that you install is software
that you have to operate. From the moment it’s activated,
someone will be taking regular time out of their schedule
to perform maintenance, troubleshoot problems, and
install upgrades. Don’t succumb to NHH (not hosted here)
when there’s a public service available that will do the
job better.
It’s not that new technology should never be introduced,
but it should be done with rational defensiveness, and with
a critical eye in how it’ll fit into an evolving (and
hopefully ever-improving) architecture.
Nothing left to add, nothing left to take away
Antoine de Saint Exupéry, a French poet and pioneering
aviator, had this to say on the subject:
It seems that perfection is reached not when there is
nothing left to add, but when there is nothing left to
take away.
Most of us can benefit from architecture that’s a little
simpler, a little more conservative, and a little more
directed. Only by concertedly building a minimal stack
that’s stable and nearly perfectly operable can we maximize
our ability to push forward with new products and ideas.
In Pursuit of Production Minimalism was published on May 10, 2017 from San Francisco.
Find me on Twitter at @brandur.
Please post comments and discussion to Hacker News.
Did I make a mistake? Please consider sending a pull request.
Photographs by Ben Harrington (SR-71), Robyn Jay (embers of a burning fire), and Md. Al Amin (boat and sky). Licensed under Creative Commons BY-NC-ND 2.0, BY-SA 2.0, and CC BY 2.0 respectively.