Building Infrastructure Platforms

Software has come a long way over the past 20 years. Not only has the
pace of delivery increased, but the architectural complexity of systems
being developed has also soared to match that pace.

Not that building software was simple in the “good” old days. If you
wanted to stand up a simple web service for your business, you’d probably
have to:

Schedule in some time with an infrastructure team to find a spare
[patched] rack server.
Spend days repeatedly configuring a bunch of load balancers and domain
names.
Persuade/cajole/bribe an IT admin to let you safelist traffic through
your corporate firewall.
Figure out whatever FTP incantation would work best for your
cobbled-together go-live script.
Make a ritual sacrifice to the cruel and fickle Gods Of Prod to bless
your service with good fortune.

Thankfully we’ve moved (or rather, we’re moving) away from this
traditional “bare metal” IT setup to one where teams are better able to
Build It & Run It. In this brave, new-ish world teams can configure their
infrastructure in a similar way to how they write their services, and can in
turn benefit from owning the entire system.

In this fresh and glistening new dawn of possibility, teams can build and
host their products and services in whatever Unicorn configuration they
choose. They can be selective with their hosting providers, technologies and
monitoring strategies. They can invent a million different ways to create
the same thing – And almost certainly do! However once your organisation has
reached a certain size, it might no longer be efficient to have your teams
building their own infrastructure. Once you start solving the same problems
over and over again it might be time to start investing in a “Platform”.

An Infrastructure Platform provides common cloud components for teams to
build upon and use to create their own solutions. All of the hosting
infrastructure (all the networking, backups, compute etc) can be managed by
the “platform team”, leaving developers free to build their solution without
having to worry about it.

By building infrastructure platforms you can save time for product teams,
reduce your cloud spend and increase the security and rigour of your
infrastructure. For these reasons, more and more execs are finding the
budget to spin up separate teams to build platform infrastructure.
Unfortunately this is where things can start to go wrong. Luckily we have
been through the ups and downs of building infrastructure platforms and have
put together some essential steps to ensure platform success!

Have a strategy with a measurable goal

“We didn’t achieve our goal” is probably the worst thing you could hear
from your stakeholders after working for weeks or months on something. In
the world of infrastructure platforms this is problematic and can lead to
your execs deciding to scrap the idea and spending their budget on other
areas (often more product teams which can exacerbate the problem!)
Preventing this isn’t rocket science – create a goal and a strategy to
deliver it that all of your stakeholders are bought into.

The first step to creating a strategy is to get the right people
together to define the problem. This should be a mixture of product and
technical executives/budget holders aided by SMEs who can help to give
context about what is happening in the organisation. Here are some
examples of good problems statements:

We don’t have enough people with infrastructure capability in our top
15 product teams, and we don’t have the resources to hire the amount we
need, delaying time to market for our products by an average of 6
months

We have had outages of our products totalling 160 hours and over $2
million lost revenue in the past 18 months

These problem statements are honest about the challenge and easy to
understand. If you can’t put together a problem statement maybe you don’t
need an infrastructure platform. And if you have many problems which you
want to tackle by creating an infrastructure platform then do list these
out, but choose one which is the driver and your focus. Having more than
one problem statement can lead to overpromising what your infrastructure
team will achieve and not deliver; prioritising too many things with
different results and not really achieving any.

Now convert your problem statement into a goal. For example:

Provide the top 15 product teams with the infrastructure they can
easily consume to reduce the time to market by an average of 6 months

Have less than 3 hours of outages in the next 18 months

Now you can create a strategy to tackle your problem. Here’s some fun
ideas on how:

Post mortem session(s)

If you followed the previous steps you’ve identified a problem
statement which exists in your organisation, so it’s probably a good
idea to find out why this is a problem. Get everyone who has context of
the problem together for a post mortem session (ideally people who will
have different perspectives and visibility of the problem).
Upfront make sure everyone is committed to the session being a safe
space where honesty is celebrated and blame is absent.
The purpose of the session is to find the root cause of problems. It
can be helpful to:
Draw out a timeline of things which happened which may have
contributed to the problem. Help each other to build the picture of the
potential causes of the problem.
Use the 5 whys technique but make sure you don’t focus on finding a
single root cause, often problems are caused by a combination of factors
together.
Once you’ve found your root causes, ask what needs to change so that
this doesn’t happen again; Do you need to create some security
guidelines? Do you need to ensure all teams are using CI/CD practises
and tooling? Do you need QAs on each team? This list also goes on…

Future backwards session

Map what would need to be true to meet your goal e.g. “all products
have multiple Availability Zones”, “all services must have a five-nines
SLA”.
Now figure out how to make these things true. Do you need to spin an
infrastructure platform team up? Do you need to hire more people? Do you
need to change some governance? Do you need to embed experts such as
infosec into teams earlier in development? And the list goes on…

We highly recommend doing both of these sessions. Using both a past
and future lens can lead to new insights for what you need to do to meet
your goal and solve your problem. Do the post mortem first, as our brains
seem to find it easier to think about the past before the future! If you
only have time for one, then do a future backwards session, because the
scope of this is slightly wider since the future hasn’t happened yet and
can foster wider ideation and outside of the box thinking.

Hopefully by the end of doing one or both of these sessions, you have a
wonderfully practical list of things you need to do to meet your goal.
This is your strategy (side note that visions and goals aren’t
strategies!!! See Good strategy Bad strategy by Richard P. Rumelt).

Interestingly you might decide that spinning up a team to build an
infrastructure platform isn’t part of your strategy and that’s fine! Infra
platforms aren’t something every organisation needs, you can skip the rest
of this article and go read something far more interesting on Martin’s
Blog! If you are lucky enough to be creating an infrastructure platform as
part of your strategy then buckle up for some more stellar advice.

Find out what your customers need

When us Agilists hear about a product which was built but then had no
users to speak of, we roll our eyes knowing that they mustn’t have done
the appropriate user research. So you might find it surprising to know
that many organisations build platform infrastructure, and then can’t get
any teams to use them. This might be because no one needed the product in
the first place. Maybe you built your infrastructure product too late and
they had already built their own? Maybe you built it too early and they
were too busy with their other backlog priorities to care? Maybe what you
built didn’t quite meet their user needs?

So before deciding what to build, do a discovery as you would with a
customer-facing product. For those who haven’t done one before, a
discovery is a (usually) timeboxed activity where a team of people
(ideally the team who will build a solution) try to understand the problem
space/reason they are building something. At the end of this period of
discovery the team should understand who the users of the infrastructure
product are (there can be more than one type of user), what problems the
users have, what the users are doing well, and some high level idea of
what infrastructure product your team will build. You can also investigate
anything else which might be useful, for example what technology people
are using, what people have tried before which didn’t work, governance
which you need to know about etc.

By defining our problem statement as part of our strategy work we
understand the organisation needs. Now we need to understand how this
overlaps with our user needs, (our users being product teams –
predominantly developers). Make sure to focus your activities with your
strategy in mind. For example if your strategy is security focussed, then
you might:

Highlight examples of security breaches including what caused them (use
info from a post mortem if you did one)
Interview a variety of people who are involved in security including Head of
Security, Head of Technology, Tech leads, developers, QAs, Delivery
managers, BAs, infosec.
Map out the existing security lifecycle of a product using workshopping
such as Event Storming. Rinse and repeat with as many teams as you can
within your timeframe that you want your infrastructure platform to be
serving.

If you only do one thing as part of your discovery, do Event
Storming. Get a team or a bunch of teams who will be your customers in a
physical room with a physical wall or on a call with a virtual whiteboard. Draw a
timeline with a start and end point on this diagram. For an infrastructure
platform discovery it can be useful to map from the start of a project to
being live in production with users.

Then ask everyone to map all the things from the start of a project to
it being live in production in sticky notes of one colour.

Next ask the teams to overlay any pain points, things which are
frustrating or things which don’t always go well in another colour.

If you have time, you can overlay any other information which might be
useful to give you an idea of the problem space that your potential users
are facing such as the technologies or systems used, the time it takes for
different parts, different teams which might be involved in the different
parts (this one is useful if you decide to deepdive into an area after the
session). During the session and after the session, the facilitators (aka
the team doing the discovery) should make sure they understand the context
around each sticky, deep diving and doing further investigation into areas
of interest where needed.

Once you’ve done some discovery activities and have got an idea of what
your users need to deliver their customer-facing products, then prioritise
what can deliver the most value the quickest. There are tons of online
resources which can help you shape your discovery – a good one is
gov.uk

Onboard users early

“That won’t work for us” is maybe the worst thing you can hear about
your infrastructure platform, especially if it comes after you’ve done all
the right things and truly understood the needs of your users (developers)
and the needs of their end users. In fact, let’s ask how you might have
gotten into this position. As you break down the infrastructure product
you are creating into epics and stories and really start to get into the
detail, you and your team will be making decisions about the product. Some
decisions you make might seem small and inconsequential so you don’t
validate every little detail with your users, and naturally you don’t want
to slow down or stop your build progress every time a small implementation
detail has to be defined. This is fine by the way! But, if months go by
and you haven’t got feedback about these small decisions you’ve made which
ultimately make up your infrastructure product, then the risk that what
you’re building might not quite work for your users is going to be ever
increasing.

In traditional product development you would define a minimum viable
product (MVP) and get early feedback. One thing we’ve battled with in
general – but even more so with infrastructure platforms – is how to know
what a “viable” product is. Thinking back to what your reason is for
building an infrastructure platform, it might be that viable is when you
have reduced security risk, or decreased time to market for a team however
if you don’t release a product to users (developers on product teams)
until it’s “viable” from this definition, then a “that won’t work for us”
response becomes more and more likely. So when thinking about
infrastructure platforms, we like to think about the Shortest Path to Value
(SPV) as the time when we want our first users to onboard. Shortest Path
to Value is as it sounds, what is the soonest you can get value, either
for your team, your users, your organisation or a mixture. We like the SPV
approach as it helps you continuously think about when the earliest
opportunity to learn is there and push for a thinner slice. So if you
haven’t noticed, the point here is to onboard users as early as possible
so that you can find out what works, find out what doesn’t work and decide
where you should put your next development efforts into improving this
infra product for the wider consumption in your organisation.

Communicate your technical vision

Perhaps unsurprisingly the key here is to make sure you articulate your
technical vision early-on. You want to prevent multiple teams from
building out the same thing as you (it happens!) Make sure your
stakeholders know what you are doing and why. Not only will this build
confidence in your solution, but it’s another opportunity to get early
insight into your product!

Your vision doesn’t have to be some high-fidelity series of UML
masterpieces (though a lot of the common modelling formats there are quite
useful to lean on). Grab a whiteboard and a sharpie/dry-erase marker and
go nuts. When you’re trying to communicate ideas things are going to get
messy, so being easily able to wipe down and start again is key! Try to
avoid the temptation to immediately jump into a CAD program for these
kinds of diagrams, they end up distancing you from the creative
process.

That being said, there are some useful tools out there which are
lightweight enough to implement at this stage. Things like:

C4 Diagrams

This was introduced by Simon Brown way back at the TURN OF THE
MILLENIA. Built on UML concepts, C4 provides not only a vocabulary for
defining systems, but also a method of decomposing a vision into 4
different “Levels” which you can then use to describe different
ideas.

Level 1: Context: The Context diagram is the most “zoomed out” of the 4. Here you
loosely highlight the system being described and how it relates to
neighbouring systems and users. Use this to frame conversations about
interactions with your platform and how your users might onboard.
Level 2: Container: The Container diagram explodes the overall Context into a bunch of
“Containers” which may contain applications and data stores. By drilling
down into some of the applications that describe your platform you can
drive conversations with your team about architectural choices. You can
even take your design to SRE folks to discuss any alerting or monitoring
considerations.
Level 3: Component: Once you understand the containers that make up your platform you can
take things to the next level. Select one of your Containers and explode
it further. See the interactions between the modules in the container
and how they relate to components in other parts of your universe. This
level of abstraction is useful to describe the responsibilities of the
inner workings of your system.
Level 4: Code: The Code diagram is the optional 4th way of describing a system. At
this level you’re literally describing the interactions between classes
and modules at a code level. Given the overhead of creating this kind of
diagram it is often useful to use automated tools to generate them. Do
make sure though that you’re not just producing Vanity Diagrams for the
sake of it. These diagrams can be super useful for describing unusual or
legacy design decisions.

Once you’ve been able to build your technical vision, use it to
communicate your progress! Bring it along to your sprint demos. Use it
to guide design conversations with your team. Take it for a little
day-trip to your next threat modelling exercise. We’ve only scratched
the surface of C4 Diagrams in this piece. There are loads of great
articles out there which explore this in more depth – to explore start with
this article on InfoQ.

And don’t stop there! Remember that although the above techniques
will help guide the conversations for now; software is a living organism
that may be there long after you’ve retired. Being able to communicate
your technical vision as a series of decisions which were able to guide
your hand is another useful tool.

Architectural Decision Records

We’ve spoken about using C4 Diagrams as a means to mapping out your
architecture. By providing a series of “windows” into your architecture
at different conceptual levels, C4 diagrams help to describe software to
different audiences and for different purposes. So whilst C4 Diagrams
are useful for mapping out your architectural present or future; ADRs
are a technique that you can use for describing your architectural
past.

Architectural Decision Records are a lightweight mechanism to
document WHAT and HOW decisions were made to build your software.
Including these in your platform repositories is akin to leaving future
teams/future you a series of well-constructed clues about why the system
is the way it is!

A Sample ADR

There are several good tools available to help you make your ADR
documents consistent (Nat Pryce’s adr-tools is very good). But generally speaking the
format for an Architectural Decision Record is as follows:

1. Title of ADR name description

Date2021-06-09

StatusPending/Accepted/Rejected

ContextA pithy sentence which describes the reason that a decision
needs to be made.

DecisionThe outcome of the decision being made. It’s very useful
to relate the decision to the wider context.

ConsequencesAny consequences that may result from making the decision.
This may relate to the team owning the software, other components
relating to the platform or even the wider organisation.

Who was thereWho was involved in the decision? This isn’t intended to be
a wagging finger in the direction of who qualified the decision or
was responsible for it. Moreover, it’s a way of adding
organisational transparency to the record so as to aid future
conversations.

Ever been in a situation where you’ve identified some weirdness in
your code? Ever wanted to reach back in time and ask whomever made
that decision why something is the way it is? Ever been stuck trying
to diagnose a production outage but for some reason you don’t have any
documentation or meaningful tests? ADRs are a great way to supplement
your working code with a living series of snapshots which document
your system and the surrounding ecosystem. If you’re interested in
reading more about ADRs you can read a little more about them in the
context of Harmel-Law’s Advice Process.

Put yourselves in your users’ shoes

If you have any internal tools or services in your organisation which
you found super easy and pain free to onboard with, then you are lucky!
From our experience it’s still so surprising when you get access to the
things you want. So imagine a world where you have spent time and effort
to build your infrastructure platform and teams who onboard say “wow, that
was easy!”. No matter your reason for building an infrastructure platform,
this should be your aim! Things don’t always go so well if you have to
mandate the usage of your infra products, so you’re going to have to
actually make an effort to make people want to use your product.

In regular product development, we might have people with capabilities
such as user research, service design, content writing, and user
experience experts. When building a platform, it’s easy to forget about
filling these roles but it’s just as important if you want people in your
organisation to enjoy using your platform products. So make sure that
there is someone in your team driving end to end service design of your
infrastructure product whether it is a developer, BA or UX person.

An easy way to get started is to draw out your user journey. Let’s take
an example of onboarding.

Even without context on what this journey is, there are things to look
out for which might signal a not so friendly user experience:

Handoffs between the developer user and your platform team
There are a few loops which might set a developer user back in their
onboarding
Lack of automation – a lot is being done by the platform team
There are 9 steps for our developer user to complete before onboarding
with possible waiting time and delays in between

Ideally you want your onboarding process to look something like
this:

As you can see, there is no Platform team involvement for the
onboarding so it is fully self service, and there are only three steps for
our developer user to follow. To achieve such a great experience for your
users, you need to be thinking about what you can automate, and what you
can simplify. There will be tradeoffs between a simple user journey and a
simple codebase (as described in “don’t over-complicate things”). Both are
important, so you need a strong product owner who can ensure that this
tradeoff works for the reason you are delivering a platform in the first
place i.e. if you are building a platform so that you can take your
products to market faster, then a seamless and quick onboarding process is
super important.

In reality, your onboarding process might look something more like
this

Especially when you release your mvp (see previous section). Apply this
thinking to any other interactions or processes which teams might have to
go through when using your product. By creating a great user experience
(and also having an infra product people want of course), you should not
only have happy users but also great publicity within your organisation so
that other teams want to onboard. Please don’t ignore this advice and get
in a position where your organisation is mandating the usage of your
nightmare-to-consume infrastructure platform and all your developer teams
are sad 🙁

Don’t over-complicate things

All software is broken. Not to put too much of a downer on things, but
every line of code that you write has a very high chance of becoming
quickly obsolete. Every If Statement, design pattern, every line of
configuration has the potential to break or to introduce a weird side
effect. These may manifest themselves as a hard-to-reproduce bug or a
full-blown outage. Your platform is no different! Just because your
product doesn’t have a fancy, responsive UI or highly-available API doesn’t
mean it isn’t liable to develop bugs. And what happens if the thing you’re
building is a platform upon which other teams are building out their own
services?

When you’re developing an infrastructure platform that other teams are
dependent upon; your customers’ dev environments are your production
environments. If your platform takes a tumble you might end up taking
everyone else with you. You really don’t want to risk introducing downtime
into another team’s dev processes. It can erode trust and even end up hurting the
relationships with the very people you were trying to help!

One of the main (and horribly insidious) reasons for bugs in software
relates to complexity. The greater the number of supported features, the
more that your platform is trying to do, the more that can go wrong. But
what’s one of the main reasons for complexity arising in platform
teams?

Conway’s Law, for those that might not already be horribly, intimately
acquainted, states that organizations tend to design systems which mirror
their own internal communication structure. What this means from a
software perspective is that often a system may be designed with certain
“caveats” or “workarounds” which cater for a certain snapshot of time in
an organisation’s history. Whilst this isn’t necessarily a bad thing, it
can too easily influence the design decisions we make on the ground. If
you’re building an API these kinds of design decisions might be
easily-enough handled within the team. But if you’re building a system
with a number of different integrations for many different teams (and
their plethora of different nuances), this gets to be more of a
problem.

So where’s the sweet spot between writing a bunch of finely-grained
components which are really tightly-coupled to business processes, and
building a platform which can support the growth of your organisation?

Generally speaking every component that you write as a team is another
thing that’ll need to be measured, maintained and supported. Granted you
may be limited by existing architectural debt, compliance constraints or
security safeguards. The take away from us here is just to think twice
before you introduce another component to your solution. Every moving part
you develop is an investment in post-live support and another potential
failure mode.

Measure the important stuff

An article about Building Better Infrastructure Platforms would not be
complete without a note about measuring things. We mentioned earlier about
making sure you define a strategy with a measurable goal. So what does
success look like? Is this something you can extract with code? Maybe you
want to increase your users’ deployment frequency by reducing their
operational friction? Maybe your true north is around providing a stable
and secure artifact repository that teams can depend upon? Take some time
to see if you can turn this success metric into a lightweight dashboard.
Being able to celebrate your Wins is a massive boon both for your team’s
morale and for helping to build confidence in your platform with the wider
organisation!

The Four Key Metrics

We literally couldn’t talk about metrics without mentioning this.
From the 2018 book Accelerate, (A brilliant read about the dev team
performance), the four key metrics are a simple enough indicator for
high-performing teams. It’s indicated by:

Delivery lead time

Rather than the time taken between “Please and Thank you” (from
initial ideation through analysis, development and delivery), here we’re
talking about the time it takes from code being committed to code
successfully running in production. The shorter (or perhaps more
importantly the more predictable) the duration of development, the
higher-performing the team can be said to be.

Deployment frequency

Why is the number of times a team deploys their software important?
Typically speaking a high frequency of deployments is also linked to
much smaller deployments. With smaller changesets being deployed into
your production environment, the safer your deployments are and the
easier to both test and remediate if there’s a need to roll back. If you
couple a high deployment frequency with a short delivery lead time you
are much more able to deliver value for your customers quickly and
safely.

Change failure rate

This brings us to “change failure rate”. The fewer times your
deployments fail when you pull the trigger to get into Production, the
higher-performing the team can be said to be. But what defines a deployment
failure? A common misconception is for change failure rate to be equated
to red pipelines only. Whilst this is useful as an indicator for
general CI/CD health; change failure rate actually describes scenarios
where Production has been impaired by a deployment, and required
a rollback or fix-forward to remediate.

If you’re able to keep an eye on this as a
metric, and reflect upon it during your team retrospectives and planning
you might be able to surface areas of technical debt which you can focus
upon.

Mean time to recovery

The last of the 4 key metrics speaks to the recovery time of your
software in the event of a deployment failure. Given that your failed
deployment may result in an outage for your users, understanding your
current exposure gives you an idea of where you might need to spend some
more effort. That’s all very well and good for conventional “Product”
development, but what about for your platform? It turns out the 4 key
metrics are even MORE important if you’re building out a common platform
for folks. Your downtime is now the downtime of other software teams.
You are now a critical dependency in your organisation’s ability to
deliver software!

It’s important to recognise that the 4 key metrics are incredibly useful
trailing indicators – They can give you a measure for how well you’ve
achieved your goals. But what if you’ve not managed to get anyone to adopt
your platform? Arguably the 4 key metrics only become useful once you have
some users. Before you get here, focusing on understanding and promoting
adoption is key!

There are many more options for measuring your software delivery, but
how much is too much? Sometimes by focussing too much on measuring
everything you can miss some of the more obviously-fixable things that
are hiding in plain sight. Recognise that not all facets of platform
design succumb to measurement. Equally, beware so-called “vanity
metrics”. If you choose to measure something please do make sure that
it’s relevant and actionable. If you select a metric that doesn’t turn a
lever for your team or your users, you’re just making more work for
yourselves. Pick the important things, throw away the rest!

Developing an infrastructure platform for other engineering teams may
seem like an entirely different beast to creating more traditional
software. But by adopting some or all of the 7 principles outlined in
this article, we think that you’ll have a much better idea of your
organisation’s true needs, a way to measure your success and ultimately
a way of communicating your intent.

BeauLebens.com

An aggregation of Beau on the internet

Building Infrastructure Platforms