How To Survive a Ground-Up Rewrite Without Losing Your Sanity
This is a guest post by Dan Milstein (@danmil), co-founder of Hut 8 Labs.
Disclosure: Joel Spolsky is a friend and I’m an investor in his company, Stack Exchange (which powers the awesome Stack Overflow) -Dharmesh
So, you know Joel Spolsky’s essay Things You Should Never Do, Part I? In which he urgently recommends that, no matter what, please god listen to me, don’t rewrite your product from scratch? And lists a bunch of dramatic failures when companies have tried to do so?
First off, he’s totally right. Developers tend to spectacularly underestimate the effort involved in such a rewrite (more on that below), and spectacularly overestimate the value generated (more on that below, as well).
But sometimes, on certain rare occasions, you’re going to be justified in rewriting a major part of your product (you’ll notice I’ve shifted to saying you’re merely rewriting a part, instead of the whole product. Please do that. If you really are committed to just rewriting the entire thing from scratch, I don’t know what to tell you).
If you’re considering launching a major rewrite, or find yourself as the tech lead on such a project in flight, or are merely toiling in the trenches of such a project, hoping against hope that it will someday end… this post is for you.
Hello, My Name is Dan, and I’ve Done Some Rewrites
A few years back, I joined a rapidly growing startup named HubSpot, where I ended up working for a good solid while (which was a marvelous experience, btw — you should all have Yoav Shapira as a boss at some point). In my first year there, I was one of the tech leads on a small team that rewrote the Marketing Analytics system (one of the key features of the HubSpot product), totally from scratch. We rewrote the back end (moving from storing raw hit data in SQLServer to processing hits with Hadoop and storing aggregate reports in MySQL); we rewrote the front end (moving from C#/ASP.Net to Java/Tomcat); we got into the guts of a dozen applications which had come to rely on that store of every-hit-ever, and found a way to make them work with the data that was now available. (Note: HubSpot is now primarily powered by MySQL/Hadoop/HBase. Check out the HubSpot dev blog).
It took a loooong time. Much, much longer than we expected.
But it generated a ton of value for HubSpot. Very Important People were, ultimately, very happy about that project. After it wrapped up, ‘Analytics 2.0’, as it was known, somehow went from ‘that project that was dragging on forever’, to ‘that major rewrite that worked out really well’.
Then, after the Analytics Rewrite wrapped up, in my role as 5 Whys Facilitator, I led the post-mortem on another ambitious rewrite which hadn’t fared quite so well. I’ll call it The Unhappy Rewrite.
From all that, some fairly clear lessons emerged.
First, I’m going to talk about why these projects are so tricky. Then I’ll pass on some of those hard-won lessons on how to survive.
Prepare Yourself For This Project To Never Fucking End
The first, absolutely critical thing to understand about launching a major rewrite is that it’s going to take insanely longer than you expect. Even when you try to discount for the usual developer optimism. Here’s why:
- Migrating the data sucks beyond all belief
I’m assuming your existing system has a bunch of valuable data locked up in it (if it doesn’t, congrats, but I just never, ever run into this situation). You think, we’re going to set up a new db structure (or move it all to some NoSQL store, or whatever), and we’ll, I dunno, write some scripts to copy the data over, no problem.
Problem 1: there’s this endless series of weird crap encoded in the data in surprising ways. E.g. “The use_conf field is 1 if we should use the auto-generated configs… but only if the spec_version field is greater than 3. Oh, and for a few months, there was this bug, and use_conf was left blank. It’s almost always safe to assume it should be 1 when it’s blank. Except for customers who bought the Express product, then we should treat it as 2”. You have to migrate all your data over, checksum the living hell out of it, display it back to your users, and then figure out why it’s not what they expect. You end up poring over commit histories, email exchanges with developers who have long since left the company, and line after line of cryptic legacy code. (In prep for writing this, when I mentioned this problem to developers, every single time they cut me off to eagerly explain some specific, awful experience they’ve had on this front — it’s really that bad)
Problem 2: But, wait, it gets worse: because you have a lot of data, it often takes days to migrate it all. So, as you struggle to figure out each of the above weird, persnickety issues with converting the data over, you end up waiting for days to see if your fixes work. And then to find the next issue and start over again. I have vivid, painful memories of watching my friend Stephen (a prototypical Smart Young Engineer), who was a tech lead on the Unhappy Rewrite, working, like, hour 70 of an 80 hour week, babysitting a slow-moving data export/import as it failed over and over and over again. I really can’t communicate how long this takes.
- It’s brutally hard to reduce scope
With a greenfield (non-rewrite) project, there is always (always) a severe reduction in scope as you get closer to launch. You start off, expecting to do A, B, C & D, but when you launch, you do part of A. But, often, people are thrilled. (And, crucially, they forget that they had once considered all the other imagined features as absolutely necessary)
With a rewrite, that fails. People are really unhappy if you tell them: hey, we rewrote your favorite part of the product, the code is a lot cleaner now, but we took away half the functionality.
You’ll end up spending this awful series of months implementing all these odd edge cases that you didn’t realize even existed. And backfilling support for features that you’ve been told no one uses any more, but you find out at the last minute some Important Person or Customer does. And, and, and…
- There turn out to be these other system that use “your” data
You always think: oh, yeah, there are these four screens, I see how to serve those from the new system. But then it turns out that a half-dozen cron jobs read data directly from “your” db. And there’s an initialization step for new customers where something is stored in that db and read back later. And some other screen makes a side call to obtain a count of your data. Etc, etc. Basically, you try turning off the old system briefly, and a flurry of bug reports show up on your desk, for features written a long time ago, by people who have left the company, but which customers still depend on. This takes forever all over again to fix.
Okay, I’m Sufficiently Scared Now, What Should I Do?
You you have to totally own the business value.
First off, before you start, you must define the business value of this rewrite. I mean, you should always understand the big picture value of what you do (see: Rands Test). But with rewrites, it’s often the tech lead, or the developers in general, who are pushing for the rewrite — and then it’s absolutely critical that you understand the value. Because you’re going to discover unexpected problems, and have to make compromises, and the whole thing is going to drag on forever. And if, at the end of all that, the Important People who sign your checks don’t see much value, it’s not going to be a happy day for you.
One thing: be very, very careful if the primary business value is some (possibly disguised) version of “The new system will be much easier for developers to work on.” I’m not saying that’s not a nice bit of value, but if that’s your only or main value… you’re going to be trying to explain to your CEO in six months why nothing seems to have gotten done in development in the last half year.
The key to fixing the “developers will cry less” thing is to identify, specifically, what the current, crappy system is holding you back from doing. E.g. are you not able to pass a security audit? Does the website routinely fall over in a way that customers notice? Is there some sexy new feature you just can’t add because the system is too hard to work with? Identifying that kind of specific problem both means you’re talking about something observable by the rest of the business, and also that you’re in a position to make smart tradeoffs when things blow up (as they will).
As an example, for our big Analytics rewrite, the developers involved sat down with Dan Dunn, the (truly excellent) product guy on our team, and worked out a list of business-visible wins we hoped to achieve. In rough priority order, those were:
- Cut cost of storing each hit by an order of magnitude
- Create new reports that weren’t possible in the old system
- Serve all reports faster
- Serve near-real-time (instead of cached daily) reports
And you should know: that first one loomed really, really large. HubSpot was growing very quickly, and storing all that hit data as individual rows in SQLServer had all sorts of extra costs. The experts on Windows ops were constantly trying to get new SQLServer clusters set up ahead of demand (which was risky and complex and ended up touching a lot of the rest of the codebase). Sales people were told to not sell to prospects with really high traffic, because if they installed our tracking code, it might knock over those key databases (and that restriction injected friction into the sales process). Etc, etc.
Solving the “no more hits in SQLServer” problem is the Hard kind for a rewrite — you only get the value when every single trace of the old system is gone. The other ones, lower down the list, those you’d see some value as individual reports were moved over. That’s a crucial distinction to understand. If at all possible, you want to make sure that you’re not only solving that kind of Hard Problem — find some wins on the way.
For the Unhappy Rewrite, the biz value wasn’t perfectly clear. And, thus, as often happens in that case, everyone assumed that, in the bright, shiny world of the New System, all their own personal pet peeves would be addressed. The new system would be faster! It would scale better! The front end would be beautiful and clever and new! It would bring our customers coffee in bed and read them the paper.
As the developers involved slogged through all the unexpected issues which arose, and had to keep pushing out their release date, they gradually realized how disappointed everyone was going to be when they saw the actual results (because all the awesome, dreamed-of stuff had gotten thrown overboard to try to get the damn thing out the door). This a crappy, crappy place to be — stressed because people are hounding you to get something long-overdue finished, and equally stressed because you know that thing is a mess.
Okay, so how do you avoid getting trapped in this particular hell?
Worship at the Altar of Incrementalism
Over my career, I’ve come to place a really strong value on figuring out how to break big changes into small, safe, value-generating pieces. It’s a sort of meta-design — designing the process of gradual, safe change.
Kent Beck calls this Succession, and describes it as:
“Design changes are usually most efficiently implemented as a series of safe steps. Succession is the art of taking a single conceptual change, breaking it into safe steps, and then finding an order for those steps that optimizes safety, feedback, and efficiency.”
I love that he calls it an “art” — that feels exactly right to me. It doesn’t happen by accident. You have to consciously work at it, talk out alternatives with your team, get some sort of product owner or manager involved to make sure the early value you’re surfacing matters to customers. It’s a creative act.
And now, let me say, in an angry Old Testament prophet voice: Beware the false incrementalism!
False incrementalism is breaking a large change up into a set of small steps, but where none of those steps generate any value on their own. E.g. you first write an entire new back end (but don’t hook it up to anything), and then write an entire new front end (but don’t launch it, because the back end doesn’t have the legacy data yet), and then migrate all the legacy data. It’s only after all of those steps are finished that you have anything of any value at all.
Fortunately, there’s a very simple test to determine if you’re falling prey to the False Incrementalism: if after each increment, an Important Person were to ask your team to drop the project right at that moment, would the business have seen some value? That is the gold standard.
Going back to my running example: our existing analytics system supported a few thousand customers, and served something like a half dozen key reports. We made an early decision to: a) rewrite all the existing reports before writing new ones, and b) rewrite each report completely, push it through to production, migrate any existing data for that report, and switch all our customers over. And only then move on to the next report.
Here’s how that completely saved us: 3 months into a rewrite which we had estimated would take 3-5 months, we had completely converted a single report. Because we had focused on getting all the way through to production, and on migrating all the old data, we had been forced to face up to how complex the overall process was going to be. We sat down, and produced a new estimate: it would take more like 8 months to finish everything up, and get fully off SQLServer.
At this point, Dan Dunn, who is a Truly Excellent Product Guy because he is unafraid to face a hard tradeoff, said, “I’d like to shift our priorities — I want to build the Sexy New Reports now, and not wait until we’re fully off SQLServer.” We said, “Even if it makes the overall rewrite take longer, and we won’t get off SQLServer this year, and we’ll have to build that one new cluster we were hoping to avoid having to set up?” And he said “Yes.” And we said, “Okay, then.”
That’s the kind of choice you want to offer the rest of your larger team. An economic tradeoff where they can chose between options of what they see when. You really, really don’t want to say: we don’t have anything yet, we’re not sure when we will, your only choices are to keep waiting, or to cancel this project and kiss your sunk costs goodbye.
Side note: Dan made 100% the right call (see: Excellent). The Sexy New Reports were a huge, runaway hit. Getting them out sooner than later made a big economic impact on the business. Which was good, because the project dragged on past the one year mark before we could finally kill off SQLServer and fully retire the old system.
For you product dev flow geeks out there, one interesting piece of value we generated early was simply a better understanding of how long the project was going to take. I believe that is what Beck means by “feedback”. It’s real value to the business. If we hadn’t pushed a single report all the way through, we would likely have had, 3-4 months in, a whole bunch of data (for all reports) in some partially built new system, and no better understanding of the full challenge of cutting even one report over. You can see the value the feedback gave us–it let Dan make a much better economic choice. I will make my once-per-blog-post pitch that you should go read Donald Reinertsen’s Principles of Product Development Flow to learn more about how reducing uncertainty generates value for a business.
For the Unhappy Rewrite, they didn’t work out a careful plan for this kind of incremental delivery. Some Totally Awesome Things would happen/be possible when they finished. But they kept on not finishing, and not finishing, and then discovering more ways that the various pieces they were building didn’t quite fit together. In the Post-Mortem, someone summarized it as: “We somehow turned this into a Waterfall project, without ever meaning to.”
But, I Have to Cut Over All at Once, Because the Data is Always Changing
One of the reasons people bail on incrementalism is that they realize that, to make it work, there’s going to be an extended period where every update to a piece of data has to go to both systems (old and new). And that’s going to be a major pain in the ass to engineer. People will think (and even say out loud), “We can’t do that, it’ll add a month to the project to insert a dual-write layer. It wil slow us down too much.”
Here’s what I’m going to say: always insert that dual-write layer. Always. It’s a minor, generally somewhat fixed cost that buys you an incredible amount of insurance. It allows you, as we did above, to gradually switch over from one system to another. It allows you to back out at any time if you discover major problems with the way the data was migrated (which you will, over and over again). It means your migration of data can take a week, and that’s not a problem, because you don’t have to freeze writes to both systems during that time. And, as a bonus, it surfaces a bunch of those weird situations where “other” systems are writing directly to your old database.
Again, I’ll quote Kent Beck, writing about how they do this at Facebook:
“We frequently migrate large amounts of data from one data store to another, to improve performance or reliability. These migrations are an example of succession, because there is no safe way to wave a wand and migrate the data in an instant. The succession we use is:
Convert data fetching and mutating to a DataType, an abstraction that hides where the data is stored.
Modify the DataType to begin writing the data to the new store as well as the old store.
Bulk migrate existing data.
Modify the DataType to read from both stores, checking that the same data is fetched and logging any differences.
When the results match closely enough, return data from the new store and eliminate the old store.
You could theoretically do this faster as a single step, but it would never work. There is just too much hidden coupling in our system. Something would go wrong with one of the steps, leading to a potentially disastrous situation of lost or corrupted data.”
Abandoning the Project Should Always Be on the Table
If a 3-month rewrite is economically rational, but a 13-month one is a giant loss, you’ll generate a lot value by realizing which of those two you’re actually facing. Unfortunately, the longer you solider on, the harder it is for people to avoid the Fallacy of Sunk Costs. The solution: if you have any uncertainty about how long it’s going to take, sequence your work to reduce that uncertainty right away, and give people some “finished” thing that will let them walk away. One month in, you can still say: we’ve decided to only rewrite the front end. Or: we’re just going to insert an API layer for now. Or, even: this turned out to be a bad idea, we’re walking away. Six months in, with no end in sight, that’s incredibly hard to do (even if it’s still the right choice, economically).
Some Specific Tactics
Shrink Ray FTW
This is an excellent idea, courtesy of Kellan Elliot-McCrea, CTO of Etsy. He describes it as follows:
“We have a pattern we call shrink ray. It’s a graph of how much the old system is still in place. Most of these run as cron jobs that grep the codebase for a key signature. Sometimes usage is from wire monitoring of a component. Sometimes there are leaderboards. There is always a party when it goes to zero. A big party.
Gives a good sense of progress and scope, especially as the project is rolling, and a good historical record of how long this shit takes. ”’
I’ve just started using Shrink Ray on a rewrite I’m tackling right now, and I will say: it’s fairly awesome. Not only does it give you the wins above, but, it also forces you to have an early discussion about what you are shrinking, and who in the business cares. If you make the right graph, Important People will be excited to see it moving down. This is crazy valuable.
Engineer The Living Hell Out Of Your Migration Scripts
It’s very easy to think of the code that moves data from the old system to the new as a collection of one-off scripts. You write them quickly, don’t comment them too carefully, don’t write unit tests, etc. All of which are generally valid tradeoffs for code which you’re only going to run once.
But, see above, you’re going to run your migrations over and over to get them right. Plus, you’re converting and summing up and copying over data, so you really, really want some unit tests to find any errors you can early on (because “data” is, to a first approximation, “a bunch of opaque numbers which don’t mean anything to you, but which people will be super pissed off about if they’re wrong”). And this thing is going to happen, where someone will accidentally hit ctrl-c, and kill your 36 hour migration at hour 34. Thus, taking the extra time to make the entire process strongly idempotent will pay off over and over (by strongly idempotent, I mean, e.g. you can restart after a failed partial run and it will pick up most of the existing work).
Basically, treat your migration code as a first class citizen. It will save you a lot of time in the long run.
If Your Data Doesn’t Look Weird, You’re Not Looking Hard Enough
What’s best is if you can get yourself to think about the problem of building confidence in your data as a real, exciting engineering challenge. Put one of your very best devs to work attacking both the old and the new data, writing tools to analyze it all, discover interesting invariants and checksums.
A good rule of thumb for migrating and checksumming data: until you’ve found a half-dozen bizarre inconsistencies in the old data, you’re not done. For the Analytics Rewrite, we created a page on our internal wiki called “Data Infelicities”. It got to be really, really long.
With Great Incrementalism Comes Great Power
I want to wrap up by flipping this all around — if you learn to approach your rewrites with this kind of ferocious, incremental discipline, you can tackle incredibly hard problems without fear. Which is a tremendous capability to offer your business. You can gradually rewrite that unbelievably horky system that the whole company depends on. You can move huge chunks of data to new data stores. You can pick up messy, half-functional open source projects and gradually build new products around them.
It’s a great feeling.
—
What’s your take? Care to share any lessons learned from an epic rewrite?