Systems Design Explains the World: Volume 1
Systems design explains the world: volume 1
“Systems design” is a branch of study that tries to find universal
architectural patterns that are valid across disciplines.
You might think that’s not a possibility. Back in university,
students used to tease the Systems Design Engineers, calling it “boxes and
arrows” engineering. Not real engineering, you see, since it didn’t touch
anything tangible, like buildings, motors, hydrochloric acid, or, uh,
electrons.
I don’t think the Systems Design people took this criticism too seriously
since everyone also knew that programme had the toughest admittance criteria
in the whole university.
(A mechanical engineer told me they saw
electrical/computer engineers the same way: waveforms on a
screen instead of real physical things that you could touch, change, and fix.)
I don’t think any of us really understood what boxes-and-arrows engineering
really was back then, but luckily for you, now I’m old. Let me tell you
some stories.
What is systems design?
I started thinking more clearly about systems design when I was at a big
tech company and helped people refine their self-promotion employee
review packets. Most of it was straightforward, helping them map
their accomplishments to the next step up the engineering ladder:
- As a Novice going for Junior, you had to prove you could fix bugs
without too much supervision; - Going for Senior, you had to prove you could implement a whole design with
little supervision; - Going for Staff, you had to show you could produce designs based on
business problems with basically no management; - Going for Senior Staff, you had to solve bigger and
bigger business problems; and so on.
After helping a few dozen people with their assessments,
I noticed a trend. Most developers mapped well onto the ladder, but some
didn’t fit, even though they seemed like great engineers to me.
There were two groups of misfits:
-
People who maxed out as a senior engineer (building things) but didn’t
seem to want to, or be able to, make it to staff engineer (translating
business problems). -
People who were ranked at junior levels, but were better at
translating business problems than at fixing bugs.
Group #1 was formally accounted for: the official word was most
employees should never expect to get past Senior Engineer. That’s why they
called it Senior. It wasn’t not much consolation to
people who wanted to earn more money or to keep improving for the next 20-30
years of a career, but it was something we could talk about.
(The book Radical
Candor by
Kim Scott has some discussion about how to handle great engineers who just
want to build things. She suggests a separate progression for “rock solid”
engineers, who want to become world-class experts at things they’re
great at, and “steep trajectory” engineers, who might have less attention to
detail but who want to manage ever-bigger goals and jump around a lot.)
People in group #2 weren’t supposed to exist. They were doing some hard jobs
– translating business problems into designs – with great expertise, but
these accomplishments weren’t interesting to the junior-level promotion
committees, who had been trained to look for “exactly one level up”
attributes like deep technical knowledge in one or two specific areas, a
history of rapid and numerous bug fixes, small independent launches, and so
on. Meanwhile, their peers who couldn’t (yet) architect their way out of a
paper bag rose more quickly through the early ranks, because they wrote
reams of code fast.
Tanya Reilly has an excellent talk (and transcribed slides) called Being
Glue that perfectly captures this effect. In her
words: “Glue work is expected when you’re senior… and risky when you’re
not.”
What she calls glue work, I’m going to call systems design. They’re two
sides of the same issue. Humans are the most unruly systems of all, and yet,
amazingly, they follow many of the same patterns as other systems.
People who are naturally excellent at glue work often stall out early in the
prescribed engineering pipeline, even when they’d be great in later stages
(staff engineers, directors, and executives) that traditional
engineers struggle at. In fact, it’s well documented that an
executive in a tech company requires almost a totally different skill set
than a programmer, and rising through the ranks doesn’t prepare
you for that job at all. Many big tech companies hire executives from
outside the company, and sometimes even from outside their own industry, for
that reason.
…but I guess I still haven’t answered the question. What is systems
design? It’s the thing that will eventually kill your project if you do it
wrong, but probably not right away. It’s macroeconomics instead of
microeconomics. It’s fixing which promotion ladders your company even has,
rather than trying to climb the ladders. It’s knowing when a distributed
system is or isn’t appropriate, not just knowing how to build one. It’s
repairing the incentives in a political system, not just getting elected and
passing your favourite laws.
Most of all, systems design is invisible to people who don’t know how to
look for it. At least with code, you can measure output by the line or the
bug, and you can hire more programmers to get more code. With systems
design, the key insight might be a one-sentence explanation given at the
right time to the right person, that affects the next 5 years of work, or is
the difference between hypergrowth and steady growth.
Sorry, I don’t know how to explain it better than that. What I can do
instead is talk about some systems design problems and archetypes that
repeat, over and over, across multiple fields. If you can recognize these
archetypes, and handle them before they kill your project,
you’re on your way to being a systems designer.
Systems of control: hierarchies and decentralization
Let’s start with an obvious one: the problem of centralized vs distributed
control structures. If I ask you what’s a
better org structure: a command-and-control hierarchy or a flat
organization, most people have been indoctrinated to say the
latter. Similarly if I ask whether you should have an old crusty centralized
database or a fancy distributed database, everyone wants to build the latter. If
you’re an SRE and we start talking about pets and cattle, you always vote
for cattle. You’d laugh at me if I suggested using anything but a distributed
software version control system (ie. git). The future of money, I’ve heard, is
distributed decentralized cryptocurrency. If you want to defeat censorship,
you need a distributed social network. The trend is clear. What’s to debate?
Well, real structures are more complicated than that. The best
introductory article I know on this topic is Jo
Freeman’s The Tyranny of
Structurelessness, which
includes the famous quote: “This apparent lack of structure too often
disguised an informal, unacknowledged and unaccountable leadership that was
all the more pernicious because its very existence was denied.”
“Informal, unacknowledged, and unaccountable” control is just as common in
distributed computing systems as it is in human social systems.
The truth is, nearly every attempt to design a hierarchy-free, “flat”
control system just moves the central control around until you can’t see it
anymore. Human structures all have leaders, whether implicit or explicit,
and the explicit ones tend to be more diverse.
The web depends on centrally
controlled DNS and centrally approved TLS certificate issuers;
the global Internet depends on
a small cabal who sorts out routing
problems.
Every blockchain depends on whoever decides if your preferred chain
will fork this week, and whoever runs the popular exchanges, and
whoever decides whether to arrest those people. Distributed radio networks
depend on centralized government spectrum licenses. Democracy depends on
someone enforcing your right to vote. Capitalism depends on someone
enforcing the rules of a “free” marketplace.
At my first startup, we tried to run the development team as a flat
organization, where everyone’s opinions were listened to and everyone could
debate the best way to do something. The overall consensus was that we
mostly succeeded. But I was shocked when one of my co-workers
said to me afterward: “Our team felt flat and egalitarian. But you can’t
ever forget that it was only that way because you forced it to be that
way.”
Truly distributed systems do exist. Earth’s ecosystem is perhaps one
(although it’s becoming increasingly fragile and dependent on humans not to
break it). Truly distributed databases using Raft
consensus or similar
algorithms certainly exist and work.
Distributed version control (like git) really is distributed, although we
ironically end up re-centralizing our usage of it through something like
Github.
CAP theorem is perhaps the
best-known statement of the tradeoffs in distributed systems, between
consistency, availability, and “partition tolerance.” Normally we think of
the CAP theorem as applying to databases, but it applies to all
distributed systems. Centralized databases do well at consistency and
availability, but suck at partition tolerance; so do authoritarian
government structures.
In systems design, there is rarely a single right answer that applies
everywhere. But with centralized vs distributed systems, my rule of thumb is
to do exactly what Jo Freeman suggested: at least make sure the control structure
is explicit. When it’s explicit, you can debug it.
Chicken-egg problems
Another archetypal systems design question is the “chicken-egg problem,”
which is short for: which came first, the chicken or the egg?
In case that’s not a common question where you come from, the idea is
eggs produce chickens, and chickens produce eggs. That’s all fine once it’s
going, but what happened, back in ancient history? Was the very first
step in the first iteration an egg, or a chicken?
The question sounds silly and faux-philosophical at first, but there’s a
real answer and that answer applies to real problems in
the business world.
The answer to the riddle is “neither”; unless you’re a
Bible literalist, you can’t trace back to the Original Chicken that
laid the Original Egg. Instead there was probably a chicken-like bird that
laid a mostly egg-ish egg, and before that, there were millions of years of
evolution, going all the way back to single-celled organisms and whatever
phenomenon first spawned those. What came “first”? All that other stuff.
Chicken-egg problems appear all the time when building software or launching
products. Which came first, HTML5 web browsers or HTML5 web content?
Neither, of course. They evolved in loose synchronization, tracing back to
the first HTML experiments and way
before HTML itself, growing slowly and then quickly in popularity along the
way.
I refer to chicken-egg problems a lot because designers are oblivious to them a
lot. Here are some famous chicken-egg problems:
- Electrical distribution networks
- Phone and fax technologies
- The Internet
- IPv6
- Every social network (who will use it if nobody is using it?)
- CDs, DVDs, and Blu-Ray vs HD DVD
- HDTV (1080p etc), 4k TV, 8k TV, 3D TV
- Interstate highways
- Company towns (usually built around a single industry)
- Ivy league universities (could you start a new one?)
- Every new video game console
- Every desktop OS, phone OS, and app store
The defining characteristic of a chicken-egg technology or product is that
it’s not useful to you unless other people use it. Since adopting new
technology isn’t free (in dollars, or time, or both), people aren’t likely
to adopt it unless they can see some value, but until they do, the value
isn’t there, so they don’t. A conundrum.
It’s remarkable to me how many dreamers think they can simply outwait
the problem (“it’ll catch on eventually!”) or outspend the problem (“my new
mobile OS will be great, we’ll just subsidize a few million phones”). And how
many people think getting past a chicken-egg problem, or not, is just
luck.
But no! Just like with real chickens and real eggs, there’s a way to do it
by bootstrapping from something smaller. The main
techniques are to lower the cost of adoption, and to deliver more value even
when there are fewer users.
Video game console makers (Nintendo, Sony, Microsoft) have become skilled
at this; they’re the only ones I know who do it on purpose every few
years. Some tricks they use are:
- Subsidizing the cost of early console sales.
- Backward compatibility, so people who buy can
use older games even before there’s much native content. - Games that are “mostly the same” but “look better” on the new console.
- Compatible gamepads between generations, so developers can port
old games more easily. - “Exclusive launch titles”: co-marketing that ensures there’s value up
front for consumers (new games!) and for content producers (subsidies,
free advertising, higher prices).
In contrast, the designs that baffle me the most are ones that absolutely
ignore the chicken-egg problem. Firefox and Ubuntu phones, distributed open
source social networks, alternative app stores, Linux on the desktop,
Netflix competitors.
Followers of this diary have already seen me rant about IPv6: it provides
nearly no value to anyone until it is 100% deployed (so we can finally shut
down IPv4!), but costs immediately in added complexity and maintenance
(building and running a whole parallel Internet). Could IPv6 have been
rolled out faster, if the designers had prioritized unwinding the chicken-egg
problem? Absolutely yes. But they didn’t acknowledge it as the absolute core
of their design problem, the way Android, Xbox, Blu-Ray, and Facebook did.
If your product or company has a chicken-egg problem, and you can’t clearly
spell out your concrete plan for solving it, then investors definitely
should not invest in your company. Solving the chicken-egg problem should be
the first thing on your list, not some afterthought.
By the way, while we’re here, there are even more advanced versions of the
chicken-egg problem. Facebook or faxes are the basic form:
the more people who use Facebook or have a fax machine, the
more value all those users get from each other.
The next level up is a two-sided market, such as Uber or Ebay. Nobody can
get a ride from Uber unless there are drivers; but drivers don’t want to
work for Uber unless they can get work. Uber has to attract both kinds of
users (and worse: in the same geographic region! at the same time of day!)
before either kind gets anything from the deal. This is hard. They
decided to spend their way to success, although even Uber was careful to
do so only in a few markets at a time, especially at first.
The most difficult level I know is a three-sided market. For example,
UberEats connects consumers, drivers, and restaurants. Getting a three-sided
market rolling is insanely complicated, expensive, and failure-prone. I
would never attempt it myself, so I’m impressed at the people who try.
UberEats had a head start since Uber had consumers and drivers in
their network already, and only needed to add “one more side” to their
market. Most of their competitors had to attract all three sides just to
start. Whoa.
If you’re building a one-sided, two-sided, or three-sided market, you’d
better understand systems design, chickens, and eggs.
Second-system effect
Taking a detour from business, let’s move to an issue
that engineers experience more directly: second-system effect, a term that
comes from the excellent book, The Mythical
Man-Month,
by Fred Brooks.
Second system effect arises through the following steps:
- An initial product starts small and is built incrementally, starting
with a low budget and a few users. - Over time, the product gains popularity and becomes profitable.
- The system evolves, getting more and more hacks on top, and
early design tradeoffs start to be a bottleneck. - The engineers figure out a new design that would fix all the mistakes we
know about, plus more! (And they’re probably right.) - Since the product is already popular, it’s easy to justify spending the
time to “do it right this time” and “build a strong platform for the next 10
years.” So a project is launched to rewrite everything from scratch. It’s
expected to take several months, maybe a couple of years,
and a big engineering team.
Sound familiar? People were trying this back in 1975 when the book was
written, and they’re still trying it now. It rarely goes well; even when it
does work, it’s incredibly painful.
25 years after the book, Joel Spolsky wrote Things you should never do, part
1
about the company-destroying effect of Netscape/Mozilla trying this.
“They did it by making the single worst strategic mistake that any
software company can make: they decided to rewrite the code from scratch.”
[Update 2020-12-28: I mention Joel’s now-20-year-old article not because
Mozilla was such a landmark example, but because it’s such a great
article.]
Some other examples of second system effect are IPv6, Python 3,
Perl 6, the Plan9 OS, and the United States system of government.
The results are remarkably consistent:
- The project takes longer than expected to reach feature parity.
- The new design often does solve the architectural problems in the
original; however, it unexpectedly creates new architectural problems that
weren’t in the original. - Development time is split (or different developers are
assigned) between maintaining the old
system and launching the new system. - As the project gets increasingly overdue, project managers are increasingly
likely to shut down the old system to force users to switch to the
new one, even though users still prefer the old one.
Second systems can be merely expensive, or they can bankrupt your company,
or destroy your user community. The attention to Perl 6 severely weakened
the progress of perl; the work on Python 3 fractured the python community
for more than a decade (and still does); IPv6 is obstinately still trying to
deprecate IPv4, 25 years later, even though the problems it was created to
solve are largely obsolete.
As for solutions, there isn’t much to say about the second system effect
except you should do your utmost to prevent it; it’s entirely self-inflicted.
Refactor your code instead. Even if it seems like incrementalism will be
more work… it’s worth it. Maintaining two systems in parallel is a lot
more expensive than you think.
In his book, Fred Brooks called it the “second” system on
purpose, because it was his opinion that after experiencing it
once, any designer will build their third and later systems more
incrementally so they never have to go through that again. If you’re
lucky enough to learn from historical wisdom, perhaps even your second
system won’t suffer from this strategic error.
A more embarrassing related problem is when large companies
try to build a replacement for their own first system, but the developers of
the first system have left or have already learned their Second System Lesson
and are not willing to play that game. Thus, a new team is
assembled to build the replacement, without the experience of having built
the first one, but with all the confidence of a group of users who are
intimately experienced with its surface flaws. I don’t even know what this
phenomenon should be called; the vicarious second system effect? Anyway, my
condolences if you find yourself building or using such a product. You can
expect years of pain.
[Update 2020-12-28: someone reminded me that
CADT (“cascade of attention-deficit
teenagers”) is probably related to this last phenomenon.]
Innovator’s dilemmas
Let’s finally talk about a systems design issue that’s good
news for your startup, albeit bad news for big companies. The
Innovator’s Dilemma
is a great book by Clayton Christensen that discusses a fascinating
phenomenon.
Innovator’s dilemmas are so elegant and beautiful you can hardly believe
they exist as such a repeatable abstraction. Here’s the latest one I’ve
heard about, via an Anandtech Article about Apple
Silicon:
A summary of the Innovator’s Dilemma is as follows:
- You (Intel in this case) make an awesome product in a highly profitable
industry. - Some crappy startup appears (ARM in this case) and makes a crappy
competing product with crappy specs. The only thing they seem to have
going for them is they can make some low-end garbage for cheap. - As a big successful company, your whole business is optimized for
improving profits and margins. Your hard-working employees realize that if
they cede the ultra-low-end garbage portion of the market to this
competitor, they’ll have more time to spend on high-valued customers. As
a bonus, your average margin goes
up! Genius. - The next year, your competitor’s product gets just a little bit better,
and you give up the new bottom of your market, and your margins and
profits further improve. This cycle repeats, year after year. (We call
this “retreating upmarket.”) - The crappy competitor has some kind of structural technical advantage that
allows their performance (however you define performance; something
relevant to your market) to improve, year over year, at a higher
percentage rate than your product can. And/or their product can do
something yours can’t do at all (in ARM’s case: power efficiency). - Eventually, one year, the crappy competitor’s product finally exceeds the performance
metrics of your own product, and promptly blows your entire
fucking company instantly to smithereens.
Hey now, we’ve started swearing, was that really called for? Yes, I think
so. If I were an Intel executive looking at this chart and Apple’s new
laptops, I would be scared out of my mind right now. There is no more
upmarket to retreat to. The competitor’s product is better, and getting
better faster than mine. The game is already over, and I didn’t even realize
I was playing.
What makes the Innovator’s Dilemma so beautiful, from a systems design point
of view, is the “dilemma” part. The dilemma comes from the fact that all
large companies are heavily optimized to discard ideas that aren’t as
profitable as their existing core business. Any company that doesn’t
optimize like this fails; by definition their profitability would go down.
So thousands of worker bees propose thousands of low-margin and high-margin
projects, and the company discards the former and invests heavily in the
latter (this is called “sustaining innovation” in the book), and they keep
making more and more money, and all is well.
But this optimization creates a corporate political environment (aha,
you see we’re still talking about systems design?) where, for example,
Intel could never create a product like ARM. A successful low-priced chip
would take time, energy, and profitability away from the high-priced chips,
and literally would have made Intel less successful for years of its history.
Even once ARM appeared and their
trendline of improvements was established, they still had lower
margins, so competing with them would still cannibalize their own high-margin
products, and worse, now ARM had a head start.
In case you’re a big company reading this: the book has a few suggestions
for what you can do to avoid this trap. But if you’re Intel, you should have
read the book a few years ago, not now.
Innovator’s dilemma plots are the prettiest when discussing hardware and
manufacturing, but the concept applies to software too, especially when
software is held back by a hardware limitation. For example, distributed
version control systems (where you download the entire repository history to
every client) were amusing toys until suddenly disks were big enough and
networks were fast enough, and then
DVCSes wiped out everything else (except in projects with huge media files).
Fancy expensive databases were the only way to get high transaction
throughput,
until SSDs came along and made any dumb database fast enough for most jobs.
Complicated database indexes and schemas were great until AWS came along and
let everyone just brute force mapreduce everything using short-term rental
VMs.
JITs were mostly untenable until memory was so much slower than CPU
that compiling was not the expensive part.
Software-based network packet processing
on a CPU was slower than custom silicon until generic CPUs got
fast enough relative to RAM. And so on.
The Innovator’s Dilemma is the book that first coined the term
“disruptive innovation.” Nowadays, startups talk about disrupting this and
disrupting that. “Disruption” is an exciting word, everybody wants to do it!
The word disruption has lost most of its meaning at this point; it’s
a joke as often as a serious claim.
But in the book, it had a meaning. There are two kinds of innovations:
sustaining and disruptive. Sustaining is the kind that big companies are
great at. If you want to make the fastest x86 processor, nobody does it
better than Intel (with AMD occasionally nipping at their heels). Intel has
every incentive to keep making their x86 processors better. They also charge
the highest margins, which means the greatest profits, which means the most
money available to pour into more sustaining innovation. There is no dilemma; they
dump money and engineers and time into that, and they mostly deliver, and it pays off.
A “disruptive” innovation was meant to refer to specifically the kind you
see in that plot up above: the kind where an entirely new thing sucks for a
very long time, and then suddenly and instantly blows you away. This is the kind
that creates the dilemma.
If you’re a startup and you think you have a truly disruptive innovation,
then that’s great news for you. It’s a perfect answer to that awkward
investor question, “What if [big company] decides to do this too?” because
the honest truth is “their own politics will tear that initiative apart from
the inside.”
The trick is to determine whether you actually have one of
these exact “disruption” things. They’re rare. And as an early startup, you
don’t yet have a historical plot like the one above that makes it clear; you
have to convince yourself that you’ll realistically be able to improve your thing faster
than the incumbent can improve theirs, over a long period of time.
Or, if your innovation only depends on an existing trend – like in the
software-based packet processing example above – then you can try to
time it so that your software product is ready to mature at the same time as
the hardware trend crosses over.
In conclusion: watch out for systems design. It’s the sort of thing that can make
you massively succeed or completely fail, independent of how well you write
code or run your company, and that’s scary. Sometimes you need some boxes
and arrows.
Unrelated