The magic of Working Backwards: a real-world case study

Published in

ITNEXT

10 min read

Aug 7, 2022

—

During my 11 years at Amazon, I saw the company grow from 3,000 engineers to 60,000 engineers, and the stock grow from $40 to $3000 (pre-split). That’s an astronomical growth by all measures. It was a wild ride.

Sometimes people ask me: What’s the secret sauce?

I actually don’t think there’s a “secret” sauce. Everything is out there for you to see in plain sight. Jeff Bezos’ ruthless business acumen. The intense work environment over-indexed on delivery and unapologetically trimming fat. The way Amazon’s leadership principles shape hiring, firing, business decisions and even daily conversations. But above all, Amazon works backwards from the customer. One of my earlier blog touched on that a bit (“The importance of Story Telling”).

My career growth during my time at Amazon mirrored Amazon’s growth to some degree. I came in as an SDE-II very much unsure of myself; left after having been a Principal Engineer for six years and having shaped a lot of the ways in which Amazon tests its software. A lot of that career growth was too, due to Working Backwards.

What is this magic “Working Backwards” thing, might you ask?

It’s very simple. In fact, deceptively simple.

Most software engineers work in a linear fashion, where they have an idea, they write a massive design document that they review with their peers, then once it’s approved they go write the code. Finally, when the thing is feature-complete, they write a customer-facing announcement, and they launch their product, sending that announcement out to the world.

Amazon thinks differently. You write the customer announcement before you write your code, before you write your design, in fact, ideally before you truly consider all the technical roadblocks you may encounter. You review it with your peers, stakeholders, customers, to align. Then, your technical design works backwards from it.

That simple?

Yeah.

But I assert that flipping the order in which things happen changes everything. And it’s not more work: you have to write that customer announcement anyways!

A good customer announcement tells me the WHAT, the WHY and the HOW, and it does not talk about implementation details at all. It’s generally in the format “We’re happy to release X. Before X, our customers struggled because a, b, c. Now with X, our customers are happier because d, e, f. To start using X, do this: …”

Maybe you absolutely love reading technical design docs and you find it easy to digest a 34-page doc that jumps right in the weeds. But for me, reading technical design docs consumes a huge amount of cognitive load. I am old fashioned and actually print the document, find a comfy place in the office, grab a red pen and a highlighter, and painstakingly build my own mental map of the proposed architecture. Our brains tend to have a finite amount of cognitive load available, and mine’s depleted by page 11 out of 34. Then, I begrudgingly grab a cup of coffee and force myself to power through the rest.

What happens to me when I read a customer-facing announcement is the What and the Why excite me. They energize it. Now I’m intrigued: how is this engineer going to pull this off on the technical side? I’m hooked. I’m a captive audience. And I have seemingly never-ending amounts of cognitive load to power through the 34 pages of technical junk, because, dang it, I’m genuinely curious. There’s already a mental map built in my head where pieces of the design fit more naturally.

There’s something else that happens though.

The actual shape of a product built with the standard “linear” process is different from the shape of a product built working backwards.

My real world example

Back in 2012, I was dealing with a little problem. Amazon traffic was doubling every year, and to make matters worse, we had specific days (Black Friday, Cybermonday) where traffic was 3x a regular day. There were always architectural bottlenecks lurking in the background, waiting to pop up at the worst moment. We needed to be doing load and performance testing more broadly across the entire company. And, I had built the technology to help engineers do just that. But now I needed to figure out how to vend it.

The product shape, had I worked linearly

Some software companies (like Google) have a monorepo, a single, massive code repository, where everybody’s code lives, and everybody works on the head version. This makes it easy to just start using somebody else’s library. Amazon, on the other hand, has thousands of small service-scoped closures, called version sets. You cannot compile code outside your version set: it is a way to isolate you from the rest of the company. You control exactly what libraries live in your version set (and their version). This is great for some things (controlling blast radius, simpler release process) and not-so-great for others (sharing code).

If I want to vend a library to you, it lives in my version set and you need to bring it into your version set. This has a bunch of immediate problems. [1] When you bring my library from my version set into your version set, it brings all my dependencies (including transitive) into your version set, so now yours is more bloated. [2] If my library uses dependency-3.2, and your code uses dependency-3.1, you’re now in dependency resolution hell for the next two hours. [3] you’ve brought a specific version of my library, so unless you import newer versions on a cadence you’ll never get new features and bug fixes, [4] oh and by the way every time you refresh my library to address #3, you have problems #1 and #2 again.

This was the status quo of vending libraries at Amazon in 2012, so if I was not working backwards, I would have followed that model. It was the natural technical solution to the problem at hand.

But what would the customer experience have been?

As an amazonian wanting to adopt the library, I would have brought it into my version set first. That was a manual process that involved going to an internal website, finding your version set, clicking on the add button, finding the library, waiting for a little while, then syncing my local copy of the source code, building and realizing that stupid library had brought a billion new dependencies into my version set, 7 of them with different versions from the ones I already had. I would have spent the rest of the day reconciling versions in all these libraries.

Then the next day, I would have had to find a way to actually run this thing. I would have to write some sort of main() that calls the library, mess with my build file so that it exports a command-line runnable, and do some ad hoc testing locally.

But ultimately I wanted to run this thing in the amazon production network, not in my machine. So then I would have had to create an environment to deploy that main(), which again involved a bunch of manual steps interleaved with some time waiting for things to happen. That would consume days 3 and 4.

By day 5, I would probably have an executable deployed to production that could run a load test. But that was just the skeleton, the infrastructure. I would then have to go write some code to actually use the library in a way that was meaningful to my product. That would consume the second week.

If you’re curious, I’ve written a more technical dive deep into all this nightmare here.

The product shape, working backwards

I knew engineers were motivated to load test their product, but I also knew they were incredibly busy. They didn’t have weeks and weeks to just do a basic load test. Spending days on basic infrastructure toil to just onboard a tool is demotivating, and so onboarding is where I could lose most of my customers.

The onboarding customer experience needed to be minutes, not weeks. You went to a UI, say give me load environment infrastructure, maybe provide a little bit of information, and have it instantly.

Working backwards from that, I envisioned an experience where I as a library vendor provided an official, production-deployable, ready-to-go parent environment via Amazon’s deployment system. The parent already had an executable with a main() that did all the necessary boilerplate to bootstrap your product-specific test code. All you needed to do is click on a button to create a child of that. The child environment would run your command line, on your hardware. And because I was leveraging the official deployment system, I got some freebies like all children environments would refresh automatically when the parent environment was updated, so I could guarantee that all my customers got timely bug fixes and new features.

I had no idea how to solve this from a technical perspective, but once I articulated it, I knew without a doubt that this was the experience I needed to work backwards from, and I was obsessed with making it happen. At first it seemed impossible because in Java it’s hard to call a piece of code that has a different runtime closure than your runtime closure.

The “linear” approach was easy: the customer would bring my library into their version set, and compile all of it together. Things got neatly linked at compile time, like God intended. The “working backwards” approach meant that now my code, compiled from my version set, needed to bootstrap customer code, compiled in a different version set, at runtime, not compile time. I started reading up on how Java loads classes, and learned about a fairly obscure and unknown feature of the JVM: you can have multiple classloaders in the same process to provide isolation. So I could load my library in one classloader, and the customer code in another classloader. But that also meant that if the same class was loaded by two class loaders, you couldn’t cast one to the other. I provided an interface class for my customers to implement to interact with the library, but Java refused to cast the version in my classloader to the version in their classloader, even if it was the exact same class. How could I have my code talk to the customer code, across classloaders then? So I kept going down the rabbit hole and learned about a couple of other fairly obscure features of the JVM to solve that problem: reflection and annotations. I wrote a more technical description of this solution here so if this piques your interest please read it.

Java classloaders, reflection, annotations, and the low level details of how these obscure features of the JVM actually work — these are all things that I probably would have never run into in “normal life.” I found them because I was obsessed with delivering a specific customer experience, and trying to find a technical solution to make that happen.

How things turned out

Between classloaders, reflection and annotations, I was able to provide an elegant customer experience, and the majority of the complexity of the infrastructure got abstracted away. You just brought the interface package into your version set, created a child of the official load test environment, and could be up and running in minutes.

Eventually the library grew to be used by thousands of services at Amazon, and it’s used even today, a decade later, to ensure availability of AWS services, the amazon.com site, Alexa and Kindle, among others.

I can attribute the growth to a lot of factors. Pure dumb luck. Having the right technology at the right time. Being a relentless cheerleader for load and performance testing at Amazon. Being very deliberate about educating engineers. Building the right strategic partnerships. Securing support from key leaders. But the primary reason it grew like it did was that it reduced the time it took to do a task from weeks to days at most. Nothing else would have mattered if it didn’t. I painstakingly analyzed every single bit of toil that people had doing a task, and I obsessed with removing it by automating it away. And that was a founding principle of the entire project.

What’s most fascinating to me is that the shape of the product would have been entirely different if I was just working linearly and building a meandering path around technical roadblocks as I encountered them, vs envisioning a desired customer experience upfront and being so inspired by it that it motivated me to figure out creative ways to deal with said technical roadblocks.

So now the ball is in your court. Learn more — there’s hundreds of great articles and books online that describe Working Backwards, and Amazon’s PRFAQ process (the way Amazon does working backwards). Try it. It doesn’t have to be a “proper” PRFAQ (it took me years and dozens of rounds of ruthless feedback to learn how to write one that I could take up to SVP). You can start with just a half-page or 1-page description of the customer announcement. Do this before you write your design or write a single line of code. And you might find out just how different the shape of the product you build ends up being. And your customers will thank you for it.

BeauLebens.com

An aggregation of Beau on the internet

Amazon’s Not So Secret Weapon

The magic of Working Backwards: a real-world case study

My real world example

The product shape, had I worked linearly

The product shape, working backwards

How things turned out

Shortlink:

Like this:

Similar Entries

The magic of Working Backwards: a real-world case study

My real world example

The product shape, had I worked linearly

The product shape, working backwards

How things turned out

Shortlink:

Share this:

Like this:

Similar Entries