One of the frequent struggles that software development teams face is how to balance quality with the pressure of delivering results and innovation.
In The False Trade-off Between Quality and Speed, I talk about the balance between speed and quality, and how the misconception that you can gain speed by trading quality away leads in the long term to less speed. But there is also another related struggle that many teams face, and it is how to prioritise Operational Excellence in the context of the pressure coming from a product roadmap they are committed to deliver.
In this article, I would like to explain why it is important to strike the right balance, elaborate a theory about why it is difficult, and offer some advice to managers and tech leaders facing the same challenges in order to help their teams navigate this complex situation.
What is Operational Excellence?
Operational Excellence refers to all the activities necessary to make sure the software is running properly, customers are being taken care of effectively, the product is performing according to expectations — often described in terms of latency, errors, and availability — and defects are solved in a timely manner.
While in the past it was common to have separate teams — one focused on delivering new functionalities and another one focused on maintenance — in modern web service development, most companies set up teams according to the mantra you build it, you run it. This basically means that a team is at the same time responsible for both developing new features and also operating their services in production.
You build it, you run it (when you have the time)
As we have described in the previous paragraph, most modern engineering teams are responsible for building their services and at the same time running them.
In order to develop new features, most teams follow some kind of planning process that often goes like this:
- Someone — usually the Product Manager — brings to the table a list of problems, opportunities, and needs that if addressed would allow the business to achieve its goals;
- The team estimates the work required to build solutions to those problems and their available work capacity for the next period;
- All these inputs produce a plan of the work that the team will undertake in the next period.
In many cases, the plan consists of a set of features the team needs to build. This is traditionally referred to as a Product Roadmap. If the planning process is done properly, the plan will usually represent a good mental model of what the team needs to build in order to be successful, and it can serve as a way to focus the team and reduce the chaos which inherently inhabits the development process.
But reality has peculiar ways to surprise us. A typical scenario that will happen at some point is that, usually in the middle of the night, your service will blow up. Response times start to increase, and before you can even wake up, your service has stopped serving requests. A good development team has a process to deal with these unexpected events — usually a rotation where an on call engineer will attend to the incident and try to mitigate it. This is almost expected and a good plan usually reserves a fixed amount of engineering capacity (usually 20%) to be spent on things like this.
But this is the point where the tension between following a plan and taking care of the software will start to manifest. Once the incident is mitigated, the team needs to identify the root cause, which might not be a trivial activity. Once the root cause is identified, it might need to be fixed immediately, to prevent a recurrence of the issue. Documentation might need to be written, new monitoring have to be set up, and the service might need to be migrated to a new environment which is more fit for purpose.
At this point the team needs to make a decision: how can we balance the need to grow the product with the requirements to maintain existing systems?
In many cases, what I have observed is that people’s innate desire to follow a plan can get in the way of doing what is right for the users and for the success of the business. Many engineers, under the pressure of a committed roadmap, deadlines, and a sense of accountability, will try to hack and patch things as best as they can, and deliver their committed work according to the original plan. At the beginning, this seems okay, but overtime it causes a deterioration of the product quality often to a point where every development work becomes full of surprises, and waking up at 2am to attend to production issues becomes the norm.
Why does it happen?
It feels nice to have a plan
Posed with the challenge of balancing project delivery and operational excellence, many teams will often pick the path of least resistance. And in many companies, the path of least resistance is to follow the plan. Why does this happen?
From my experience, most companies end up creating an environment where compliance with a plan is highly encouraged because of a combination of the following factors:
- a culture that that puts an emphasis on top-down decision making;
- a performance management system that rewards employees for shipping on time;
- a lack of empathy towards the users of your product.
Culture plays a big role in shaping the way people inside a company makes decisions. When unexpected circumstances happen, employees need to basically answer this question:
Is it safe to respond to change over blindly following the plan?
Your culture has many ways in which it influences when it is safe to respond to change, but it all boils down to two main factors:
a) how much do employees feel empowered to make decisions?
b) what is the reaction from management when change is required?
If employees don’t feel empowered to make decisions, the company might miss an opportunity to respond to change in a way that helps create value for customers. For example, a very common scenario is for employees to ignore the problem, until it becomes big enough for management to notice. But at that point, the problem will probably be so big that the cost to fix it is going to be very high.
Some employees might decide to escalate the problem up their management chain, and the way management reacts to the escalation sets the tone for how employees will perceive the safety of change. For example, if management shows that they are annoyed because of the change, or they blame the individuals that from their point of view are responsible for the need to change, the team will develop a perception that deviating from the plan is not welcome, and they will feel less safe in responding to change.
Companies use several different performance management systems to reward employees for exceptional performance. Many of these rewards put an exceptional premium on delivering large scale projects. I think many of us have heard things like “John deserves a bonus of 55% because he launched automated summaries through GenAI” more than “Under John’s leadership the team achieved a system stability of 99.99%”. Many management teams deeply care about Operational Excellence, but they inadvertently discourage it by creating an excessive hype around project delivery.
Finally, it is important to remember that Operational Excellence is part of what your users experience as part of their customer experience. Sometimes, the development team becomes so focused on themselves and their plans that they forget that there is a human being on the other side of their product which is trying to get some work done. When this happens, Operational Excellence is perceived as an hindrance that prevents engineers from working on what they really like, and as such, it will not get actively prioritised.
How to prioritise Operational Excellence
In this paragraph I would like to explore some techniques that I have used in the past to create an environment where Operational Excellence is prioritised together with project delivery.
Given that plans often get in the way of doing what is right for the user, does it mean that we don’t need plans at all?
All models are wrong but some are useful — George Box
Plans play a useful role in software development — for example, they often help understand the exact extent of a change, focus the development process on key strategic initiatives, or prevent initiatives whose effects cancel each other out. A plan is also a very useful tool to understand dependencies between teams, and allow those teams to coordinate their actions for the success of the initiative.
The key is to internalise the concept that no matter how good or accurate a plan is — a plan will always be a partial representation of reality. A plan, at its best, can have a structure which is similar to the reality it represents, but it can never be the same thing. In Alfred Korzybski’s words:
A map is not the territory it represents, but, if correct, it has a similar structure to the territory, which accounts for its usefulness — Alfred Korzybski, Science and Sanity
The downsides that come from the adoption of plans come not from plans in themselves, but from assuming that a map is the territory, and roadmap-driven work seems to be the only work that is allowed. The problem is not because we have plans — the problem is how safe people feel to respond to change.
Here are three ways you can increase safety and improve the way Operational Excellence is prioritised.
Focus on the customer
As we have seen in the previous paragraphs, Operational Excellence is a key component of providing a great customer experience.
In order to ensure Operational Excellence is prioritised, it is important to create a relentless focus on the customer. In a healthy engineering team, the starting point should always be the customer. Simply asking the question “What would the customer like us to do in this case?” goes a long way in helping people make the right decisions.
The problem is that usually the customer is exactly the person whose voice is missing when this question is asked, thus it’s important to find ways to represent the customer’s point of view in these discussions.
There are two ways to do this.
First, it’s useful to define what a good customer experience looks like, and give the team a way to see how their services are doing in that regard — this can be achieved by defining Service Level Indicators (SLIs) and Service Level Objectives (SLOs). Once the team has defined them, you can sit together to review them on a regular basis — for example every week — and dive deep every time that there is a change in the level of attainment. This builds an expectation that these metrics matter, and even a small deviation is taken seriously.
Second, while SLIs and SLOs give us a way to measure the customer experience, they don’t tell us why this matters. In order to find the why, it’s important that you immerse yourself and your team into customer feedback. You need to develop a way for the team to be regularly exposed to customer feedback — for example, through customer visits or rotations in customer support — and then organise debrief sessions so the team can look back at the experience and understand why their products matter.
Empower the team to deviate from the plan
Creating a relentless focus on the customer is not enough if people are not empowered to act on the insights they get from a close contact with them. If you want to create an environment where people can respond to change, it is necessary to push authority to those same people so they feel control over their work and they are empowered to drive it as they see fit.
This is, in essence, the concept of an empowered product team.
The team needs to feel in charge of their work, both in terms of how they perform it but also what they focus on. In exchange, the team needs to be accountable for their performance. If a team needs to ask for permission in order to change their plans, this will reduce the likelihood that they will be able to respond to change.
In order to do it, you need to be mindful not only of explicit permission checks but also of behaviours that might create an implicit need to ask for permission. For example, how you react when plans change will implicitly set the limits on how much plans can change. If you have a negative reaction to a changing date, the takeaway for the team will be that change is not welcome, and they will feel less empowered to make changes next time.
A bit of mindfulness goes a long way in ensuring your actions create the environment you desire. For example, I like for all my projects to have explicit target dates, but I set explicit expectations with the team that dates can change, and I provide a simple checklist to navigate the change:
- Why is the date changing?
- What have you tried already to avoid changing the date?
- What is the new date and who needs to know about the change?
This creates accountability around the change, but it still keeps control with the people making the changes.
Walk the talk
It is very easy to undermine your efforts in prioritising Operational Excellence if your actions give the impression that it is not a priority for you. In general, every employee learns to understand the difference between what leaders say they care about, and what they really care about. You can just look at where leaders are spending the two most critical resources they have available: time and money.
If a leader says that they care about Operational Excellence but they never spend time with the team to review operational metrics, while at the same time attending every project review, it is reasonable for the team to assume that Operational Excellence is a second-class citizen and they should prioritise their time in the same way.
A similar phenomenon can be observed by looking at the way the company rewards exceptional performance. If people are getting rewarded for project delivery without any accountability for Operational Excellence, it is a reasonable strategy for an employee to focus on project delivery to maximise their personal gain.
For these reasons, it is important that as a leader you walk the talk and act as Operational Excellence really mattered for you. As a leader, you should for example:
- meet with the team to review operational metrics;
- hold team leaders accountable for performance against SLOs and ask questions when there are deviations;
- when assessing performance, balance achievements in project delivery with investments in Operational Excellence.
Conclusion
Prioritising Operational Excellence can feel like an uphill battle, but your customers will thank you for that. Your software will get better, customers will be able to experience it without interruptions, engineers will be able to spend more time building software, developing their skills, and improving their morale. Fail to prioritise Operational Excellence, and problems will start to pile up, making them more difficult to solve, to the point where the software is not sustainable anymore and more drastic solutions are needed.
As a leader, change starts with you and you have the power to change your team’s operational story.