Building Infrastructure Platforms
Software has come a long way over the past 20 years. Not only has the
    pace of delivery increased, but the architectural complexity of systems
    being developed has also soared to match that pace.
Not that building software was simple in the “good” old days. If you
    wanted to stand up a simple web service for your business, you’d probably
    have to:
- Schedule in some time with an infrastructure team to find a spare
[patched] rack server. - Spend days repeatedly configuring a bunch of load balancers and domain
names. - Persuade/cajole/bribe an IT admin to let you safelist traffic through
your corporate firewall. - Figure out whatever FTP incantation would work best for your
cobbled-together go-live script. - Make a ritual sacrifice to the cruel and fickle Gods Of Prod to bless
your service with good fortune. 
Thankfully we’ve moved (or rather, we’re moving) away from this
    traditional “bare metal” IT setup to one where teams are better able to
    Build It & Run It. In this brave, new-ish world teams can configure their
    infrastructure in a similar way to how they write their services, and can in
    turn benefit from owning the entire system.
In this fresh and glistening new dawn of possibility, teams can build and
    host their products and services in whatever Unicorn configuration they
    choose. They can be selective with their hosting providers, technologies and
    monitoring strategies. They can invent a million different ways to create
    the same thing – And almost certainly do! However once your organisation has
    reached a certain size, it might no longer be efficient to have your teams
    building their own infrastructure. Once you start solving the same problems
    over and over again it might be time to start investing in a “Platform”.
An Infrastructure Platform provides common cloud components for teams to
    build upon and use to create their own solutions. All of the hosting
    infrastructure (all the networking, backups, compute etc) can be managed by
    the “platform team”, leaving developers free to build their solution without
    having to worry about it.
By building infrastructure platforms you can save time for product teams,
    reduce your cloud spend and increase the security and rigour of your
    infrastructure. For these reasons, more and more execs are finding the
    budget to spin up separate teams to build platform infrastructure.
    Unfortunately this is where things can start to go wrong. Luckily we have
    been through the ups and downs of building infrastructure platforms and have
    put together some essential steps to ensure platform success!
Have a strategy with a measurable goal
“We didn’t achieve our goal” is probably the worst thing you could hear
      from your stakeholders after working for weeks or months on something. In
      the world of infrastructure platforms this is problematic and can lead to
      your execs deciding to scrap the idea and spending their budget on other
      areas (often more product teams which can exacerbate the problem!)
      Preventing this isn’t rocket science – create a goal and a strategy to
      deliver it that all of your stakeholders are bought into.
The first step to creating a strategy is to get the right people
      together to define the problem. This should be a mixture of product and
      technical executives/budget holders aided by SMEs who can help to give
      context about what is happening in the organisation. Here are some
      examples of good problems statements:
We don’t have enough people with infrastructure capability in our top
15 product teams, and we don’t have the resources to hire the amount we
need, delaying time to market for our products by an average of 6
months
We have had outages of our products totalling 160 hours and over $2
million lost revenue in the past 18 months
These problem statements are honest about the challenge and easy to
      understand. If you can’t put together a problem statement maybe you don’t
      need an infrastructure platform. And if you have many problems which you
      want to tackle by creating an infrastructure platform then do list these
      out, but choose one which is the driver and your focus. Having more than
      one problem statement can lead to overpromising what your infrastructure
      team will achieve and not deliver; prioritising too many things with
      different results and not really achieving any. 
Now convert your problem statement into a goal. For example:
Provide the top 15 product teams with the infrastructure they can
easily consume to reduce the time to market by an average of 6 months
Have less than 3 hours of outages in the next 18 months
Now you can create a strategy to tackle your problem. Here’s some fun
      ideas on how:
Post mortem session(s)
- If you followed the previous steps you’ve identified a problem
statement which exists in your organisation, so it’s probably a good
idea to find out why this is a problem. Get everyone who has context of
the problem together for a post mortem session (ideally people who will
have different perspectives and visibility of the problem). - Upfront make sure everyone is committed to the session being a safe
space where honesty is celebrated and blame is absent. - The purpose of the session is to find the root cause of problems. It
can be helpful to: - Draw out a timeline of things which happened which may have
contributed to the problem. Help each other to build the picture of the
potential causes of the problem. - Use the 5 whys technique but make sure you don’t focus on finding a
single root cause, often problems are caused by a combination of factors
together. - Once you’ve found your root causes, ask what needs to change so that
this doesn’t happen again; Do you need to create some security
guidelines? Do you need to ensure all teams are using CI/CD practises
and tooling? Do you need QAs on each team? This list also goes on… 
Future backwards session
- Map what would need to be true to meet your goal e.g. “all products
have multiple Availability Zones”, “all services must have a five-nines
SLA”. - Now figure out how to make these things true. Do you need to spin an
infrastructure platform team up? Do you need to hire more people? Do you
need to change some governance? Do you need to embed experts such as
infosec into teams earlier in development? And the list goes on… 
We highly recommend doing both of these sessions. Using both a past
      and future lens can lead to new insights for what you need to do to meet
      your goal and solve your problem. Do the post mortem first, as our brains
      seem to find it easier to think about the past before the future! If you
      only have time for one, then do a future backwards session, because the
      scope of this is slightly wider since the future hasn’t happened yet and
      can foster wider ideation and outside of the box thinking. 
Hopefully by the end of doing one or both of these sessions, you have a
      wonderfully practical list of things you need to do to meet your goal.
      This is your strategy (side note that visions and goals aren’t
      strategies!!! See Good strategy Bad strategy by Richard P. Rumelt). 
Interestingly you might decide that spinning up a team to build an
      infrastructure platform isn’t part of your strategy and that’s fine! Infra
      platforms aren’t something every organisation needs, you can skip the rest
      of this article and go read something far more interesting on Martin’s
      Blog! If you are lucky enough to be creating an infrastructure platform as
      part of your strategy then buckle up for some more stellar advice.
Find out what your customers need
When us Agilists hear about a product which was built but then had no
      users to speak of, we roll our eyes knowing that they mustn’t have done
      the appropriate user research. So you might find it surprising to know
      that many organisations build platform infrastructure, and then can’t get
      any teams to use them. This might be because no one needed the product in
      the first place. Maybe you built your infrastructure product too late and
      they had already built their own? Maybe you built it too early and they
      were too busy with their other backlog priorities to care? Maybe what you
      built didn’t quite meet their user needs?
So before deciding what to build, do a discovery as you would with a
      customer-facing product. For those who haven’t done one before, a
      discovery is a (usually) timeboxed activity where a team of people
      (ideally the team who will build a solution) try to understand the problem
      space/reason they are building something. At the end of this period of
      discovery the team should understand who the users of the infrastructure
      product are (there can be more than one type of user), what problems the
      users have, what the users are doing well, and some high level idea of
      what infrastructure product your team will build. You can also investigate
      anything else which might be useful, for example what technology people
      are using, what people have tried before which didn’t work, governance
      which you need to know about etc. 
By defining our problem statement as part of our strategy work we
      understand the organisation needs. Now we need to understand how this
      overlaps with our user needs, (our users being product teams –
      predominantly developers). Make sure to focus your activities with your
      strategy in mind. For example if your strategy is security focussed, then
      you might: 
- Highlight examples of security breaches including what caused them (use
info from a post mortem if you did one) - Interview a variety of people who are involved in security including Head of
Security, Head of Technology, Tech leads, developers, QAs, Delivery
managers, BAs, infosec. - Map out the existing security lifecycle of a product using workshopping
such as Event Storming. Rinse and repeat with as many teams as you can
within your timeframe that you want your infrastructure platform to be
serving. 
If you only do one thing as part of your discovery, do Event
      Storming. Get a team or a bunch of teams who will be your customers in a
      physical room with a physical wall or on a call with a virtual whiteboard. Draw a
      timeline with a start and end point on this diagram. For an infrastructure
      platform discovery it can be useful to map from the start of a project to
      being live in production with users. 
Then ask everyone to map all the things from the start of a project to
      it being live in production in sticky notes of one colour. 
Next ask the teams to overlay any pain points, things which are
      frustrating or things which don’t always go well in another colour. 
If you have time, you can overlay any other information which might be
      useful to give you an idea of the problem space that your potential users
      are facing such as the technologies or systems used, the time it takes for
      different parts, different teams which might be involved in the different
      parts (this one is useful if you decide to deepdive into an area after the
      session). During the session and after the session, the facilitators (aka
      the team doing the discovery) should make sure they understand the context
      around each sticky, deep diving and doing further investigation into areas
      of interest where needed. 
Once you’ve done some discovery activities and have got an idea of what
      your users need to deliver their customer-facing products, then prioritise
      what can deliver the most value the quickest. There are tons of online
      resources which can help you shape your discovery – a good one is
      gov.uk
Onboard users early
“That won’t work for us” is maybe the worst thing you can hear about
      your infrastructure platform, especially if it comes after you’ve done all
      the right things and truly understood the needs of your users (developers)
      and the needs of their end users. In fact, let’s ask how you might have
      gotten into this position. As you break down the infrastructure product
      you are creating into epics and stories and really start to get into the
      detail, you and your team will be making decisions about the product. Some
      decisions you make might seem small and inconsequential so you don’t
      validate every little detail with your users, and naturally you don’t want
      to slow down or stop your build progress every time a small implementation
      detail has to be defined. This is fine by the way! But, if months go by
      and you haven’t got feedback about these small decisions you’ve made which
      ultimately make up your infrastructure product, then the risk that what
      you’re building might not quite work for your users is going to be ever
      increasing. 
In traditional product development you would define a minimum viable
      product (MVP) and get early feedback. One thing we’ve battled with in
      general – but even more so with infrastructure platforms – is how to know
      what a “viable” product is. Thinking back to what your reason is for
      building an infrastructure platform, it might be that viable is when you
      have reduced security risk, or decreased time to market for a team however
      if you don’t release a product to users (developers on product teams)
      until it’s “viable” from this definition, then a “that won’t work for us”
      response becomes more and more likely. So when thinking about
      infrastructure platforms, we like to think about the Shortest Path to Value
      (SPV) as the time when we want our first users to onboard. Shortest Path
      to Value is as it sounds, what is the soonest you can get value, either
      for your team, your users, your organisation or a mixture. We like the SPV
      approach as it helps you continuously think about when the earliest
      opportunity to learn is there and push for a thinner slice. So if you
      haven’t noticed, the point here is to onboard users as early as possible
      so that you can find out what works, find out what doesn’t work and decide
      where you should put your next development efforts into improving this
      infra product for the wider consumption in your organisation. 
Communicate your technical vision
Perhaps unsurprisingly the key here is to make sure you articulate your
      technical vision early-on. You want to prevent multiple teams from
      building out the same thing as you (it happens!) Make sure your
      stakeholders know what you are doing and why. Not only will this build
      confidence in your solution, but it’s another opportunity to get early
      insight into your product!
Your vision doesn’t have to be some high-fidelity series of UML
      masterpieces (though a lot of the common modelling formats there are quite
      useful to lean on). Grab a whiteboard and a sharpie/dry-erase marker and
      go nuts. When you’re trying to communicate ideas things are going to get
      messy, so being easily able to wipe down and start again is key! Try to
      avoid the temptation to immediately jump into a CAD program for these
      kinds of diagrams, they end up distancing you from the creative
      process.
That being said, there are some useful tools out there which are
      lightweight enough to implement at this stage. Things like:
C4 Diagrams
This was introduced by Simon Brown way back at the TURN OF THE
        MILLENIA. Built on UML concepts, C4 provides not only a vocabulary for
        defining systems, but also a method of decomposing a vision into 4
        different “Levels” which you can then use to describe different
        ideas.
- Level 1: Context
 - The Context diagram is the most “zoomed out” of the 4. Here you
loosely highlight the system being described and how it relates to
neighbouring systems and users. Use this to frame conversations about
interactions with your platform and how your users might onboard. - Level 2: Container
 - The Container diagram explodes the overall Context into a bunch of
“Containers” which may contain applications and data stores. By drilling
down into some of the applications that describe your platform you can
drive conversations with your team about architectural choices. You can
even take your design to SRE folks to discuss any alerting or monitoring
considerations. - Level 3: Component
 - Once you understand the containers that make up your platform you can
take things to the next level. Select one of your Containers and explode
it further. See the interactions between the modules in the container
and how they relate to components in other parts of your universe. This
level of abstraction is useful to describe the responsibilities of the
inner workings of your system. - Level 4: Code
 - The Code diagram is the optional 4th way of describing a system. At
this level you’re literally describing the interactions between classes
and modules at a code level. Given the overhead of creating this kind of
diagram it is often useful to use automated tools to generate them. Do
make sure though that you’re not just producing Vanity Diagrams for the
sake of it. These diagrams can be super useful for describing unusual or
legacy design decisions. 
Once you’ve been able to build your technical vision, use it to
      communicate your progress! Bring it along to your sprint demos. Use it
      to guide design conversations with your team. Take it for a little
      day-trip to your next threat modelling exercise. We’ve only scratched
      the surface of C4 Diagrams in this piece. There are loads of great
      articles out there which explore this in more depth – to explore start with
      this article on InfoQ.
And don’t stop there! Remember that although the above techniques
      will help guide the conversations for now; software is a living organism
      that may be there long after you’ve retired. Being able to communicate
      your technical vision as a series of decisions which were able to guide
      your hand is another useful tool. 
Architectural Decision Records
We’ve spoken about using C4 Diagrams as a means to mapping out your
        architecture. By providing a series of “windows” into your architecture
        at different conceptual levels, C4 diagrams help to describe software to
        different audiences and for different purposes. So whilst C4 Diagrams
        are useful for mapping out your architectural present or future; ADRs
        are a technique that you can use for describing your architectural
        past.
Architectural Decision Records are a lightweight mechanism to
        document WHAT and HOW decisions were made to build your software.
        Including these in your platform repositories is akin to leaving future
        teams/future you a series of well-constructed clues about why the system
        is the way it is!
A Sample ADR
There are several good tools available to help you make your ADR
          documents consistent (Nat Pryce’s adr-tools is very good). But generally speaking the
          format for an Architectural Decision Record is as follows:
Date2021-06-09
StatusPending/Accepted/Rejected
ContextA pithy sentence which describes the reason that a decision
              needs to be made.
DecisionThe outcome of the decision being made. It’s very useful
              to relate the decision to the wider context.
ConsequencesAny consequences that may result from making the decision.
              This may relate to the team owning the software, other components
              relating to the platform or even the wider organisation.
Who was thereWho was involved in the decision? This isn’t intended to be
              a wagging finger in the direction of who qualified the decision or
              was responsible for it. Moreover, it’s a way of adding
              organisational transparency to the record so as to aid future
              conversations.
Ever been in a situation where you’ve identified some weirdness in
          your code? Ever wanted to reach back in time and ask whomever made
          that decision why something is the way it is? Ever been stuck trying
          to diagnose a production outage but for some reason you don’t have any
          documentation or meaningful tests? ADRs are a great way to supplement
          your working code with a living series of snapshots which document
          your system and the surrounding ecosystem. If you’re interested in
          reading more about ADRs you can read a little more about them in the
          context of Harmel-Law’s Advice Process.
Put yourselves in your users’ shoes
If you have any internal tools or services in your organisation which
      you found super easy and pain free to onboard with, then you are lucky!
      From our experience it’s still so surprising when you get access to the
      things you want. So imagine a world where you have spent time and effort
      to build your infrastructure platform and teams who onboard say “wow, that
      was easy!”. No matter your reason for building an infrastructure platform,
      this should be your aim! Things don’t always go so well if you have to
      mandate the usage of your infra products, so you’re going to have to
      actually make an effort to make people want to use your product. 
In regular product development, we might have people with capabilities
      such as user research, service design, content writing, and user
      experience experts. When building a platform, it’s easy to forget about
      filling these roles but it’s just as important if you want people in your
      organisation to enjoy using your platform products. So make sure that
      there is someone in your team driving end to end service design of your
      infrastructure product whether it is a developer, BA or UX person. 
An easy way to get started is to draw out your user journey. Let’s take
      an example of onboarding. 
Even without context on what this journey is, there are things to look
      out for which might signal a not so friendly user experience: 
- Handoffs between the developer user and your platform team
 - There are a few loops which might set a developer user back in their
onboarding - Lack of automation – a lot is being done by the platform team
 - There are 9 steps for our developer user to complete before onboarding
with possible waiting time and delays in between 
Ideally you want your onboarding process to look something like
      this:
As you can see, there is no Platform team involvement for the
      onboarding so it is fully self service, and there are only three steps for
      our developer user to follow. To achieve such a great experience for your
      users, you need to be thinking about what you can automate, and what you
      can simplify. There will be tradeoffs between a simple user journey and a
      simple codebase (as described in “don’t over-complicate things”). Both are
      important, so you need a strong product owner who can ensure that this
      tradeoff works for the reason you are delivering a platform in the first
      place i.e. if you are building a platform so that you can take your
      products to market faster, then a seamless and quick onboarding process is
      super important.
In reality, your onboarding process might look something more like
      this
Especially when you release your mvp (see previous section). Apply this
      thinking to any other interactions or processes which teams might have to
      go through when using your product. By creating a great user experience
      (and also having an infra product people want of course), you should not
      only have happy users but also great publicity within your organisation so
      that other teams want to onboard. Please don’t ignore this advice and get
      in a position where your organisation is mandating the usage of your
      nightmare-to-consume infrastructure platform and all your developer teams
      are sad 🙁
Don’t over-complicate things
All software is broken. Not to put too much of a downer on things, but
      every line of code that you write has a very high chance of becoming
      quickly obsolete. Every If Statement, design pattern, every line of
      configuration has the potential to break or to introduce a weird side
      effect. These may manifest themselves as a hard-to-reproduce bug or a
      full-blown outage. Your platform is no different! Just because your
      product doesn’t have a fancy, responsive UI or highly-available API doesn’t
      mean it isn’t liable to develop bugs. And what happens if the thing you’re
      building is a platform upon which other teams are building out their own
      services?
When you’re developing an infrastructure platform that other teams are
      dependent upon; your customers’ dev environments are your production
      environments. If your platform takes a tumble you might end up taking
      everyone else with you. You really don’t want to risk introducing downtime
      into another team’s dev processes. It can erode trust and even end up hurting the
      relationships with the very people you were trying to help!
One of the main (and horribly insidious) reasons for bugs in software
      relates to complexity. The greater the number of supported features, the
      more that your platform is trying to do, the more that can go wrong. But
      what’s one of the main reasons for complexity arising in platform
      teams?
Conway’s Law, for those that might not already be horribly, intimately
      acquainted, states that organizations tend to design systems which mirror
      their own internal communication structure. What this means from a
      software perspective is that often a system may be designed with certain
      “caveats” or “workarounds” which cater for a certain snapshot of time in
      an organisation’s history. Whilst this isn’t necessarily a bad thing, it
      can too easily influence the design decisions we make on the ground. If
      you’re building an API these kinds of design decisions might be
      easily-enough handled within the team. But if you’re building a system
      with a number of different integrations for many different teams (and
      their plethora of different nuances), this gets to be more of a
      problem.
So where’s the sweet spot between writing a bunch of finely-grained
      components which are really tightly-coupled to business processes, and
      building a platform which can support the growth of your organisation?
      
Generally speaking every component that you write as a team is another
      thing that’ll need to be measured, maintained and supported. Granted you
      may be limited by existing architectural debt, compliance constraints or
      security safeguards. The take away from us here is just to think twice
      before you introduce another component to your solution. Every moving part
      you develop is an investment in post-live support and another potential
      failure mode.
Measure the important stuff
An article about Building Better Infrastructure Platforms would not be
      complete without a note about measuring things. We mentioned earlier about
      making sure you define a strategy with a measurable goal. So what does
      success look like? Is this something you can extract with code? Maybe you
      want to increase your users’ deployment frequency by reducing their
      operational friction? Maybe your true north is around providing a stable
      and secure artifact repository that teams can depend upon? Take some time
      to see if you can turn this success metric into a lightweight dashboard.
      Being able to celebrate your Wins is a massive boon both for your team’s
      morale and for helping to build confidence in your platform with the wider
      organisation!
The Four Key Metrics
We literally couldn’t talk about metrics without mentioning this.
        From the 2018 book Accelerate, (A brilliant read about the dev team
        performance), the four key metrics are a simple enough indicator for
        high-performing teams. It’s indicated by: 
- Delivery lead time
 - Rather than the time taken between “Please and Thank you” (from
initial ideation through analysis, development and delivery), here we’re
talking about the time it takes from code being committed to code
successfully running in production. The shorter (or perhaps more
importantly the more predictable) the duration of development, the
higher-performing the team can be said to be. - Deployment frequency
 - Why is the number of times a team deploys their software important?
Typically speaking a high frequency of deployments is also linked to
much smaller deployments. With smaller changesets being deployed into
your production environment, the safer your deployments are and the
easier to both test and remediate if there’s a need to roll back. If you
couple a high deployment frequency with a short delivery lead time you
are much more able to deliver value for your customers quickly and
safely. - Change failure rate
 - 
This brings us to “change failure rate”. The fewer times your
deployments fail when you pull the trigger to get into Production, the
higher-performing the team can be said to be. But what defines a deployment
failure? A common misconception is for change failure rate to be equated
to red pipelines only. Whilst this is useful as an indicator for
general CI/CD health; change failure rate actually describes scenarios
where Production has been impaired by a deployment, and required
a rollback or fix-forward to remediate.If you’re able to keep an eye on this as a
metric, and reflect upon it during your team retrospectives and planning
you might be able to surface areas of technical debt which you can focus
upon. - Mean time to recovery
 - The last of the 4 key metrics speaks to the recovery time of your
software in the event of a deployment failure. Given that your failed
deployment may result in an outage for your users, understanding your
current exposure gives you an idea of where you might need to spend some
more effort. That’s all very well and good for conventional “Product”
development, but what about for your platform? It turns out the 4 key
metrics are even MORE important if you’re building out a common platform
for folks. Your downtime is now the downtime of other software teams.
You are now a critical dependency in your organisation’s ability to
deliver software! 
It’s important to recognise that the 4 key metrics are incredibly useful
        trailing indicators – They can give you a measure for how well you’ve
        achieved your goals. But what if you’ve not managed to get anyone to adopt
        your platform? Arguably the 4 key metrics only become useful once you have
        some users. Before you get here, focusing on understanding and promoting
        adoption is key!
There are many more options for measuring your software delivery, but
        how much is too much? Sometimes by focussing too much on measuring
        everything you can miss some of the more obviously-fixable things that
        are hiding in plain sight. Recognise that not all facets of platform
        design succumb to measurement. Equally, beware so-called “vanity
        metrics”. If you choose to measure something please do make sure that
        it’s relevant and actionable. If you select a metric that doesn’t turn a
        lever for your team or your users, you’re just making more work for
        yourselves. Pick the important things, throw away the rest!
Developing an infrastructure platform for other engineering teams may
      seem like an entirely different beast to creating more traditional
      software. But by adopting some or all of the 7 principles outlined in
      this article, we think that you’ll have a much better idea of your
      organisation’s true needs, a way to measure your success and ultimately
      a way of communicating your intent.