To Benefit From AI, Your Organization’s Learning Loops Must Evolve
Ideas are cheap. AI is posed to make outputs just as cheap. But without the higher-level feedback loops, your organization’s decision-making will remain a bottleneck to creating value.
A long time ago, the biggest obstacle to success was having the right groundbreaking idea. Today, ideas are cheap — the biggest obstacle is putting in the work to bring one idea to life. But not for much longer.
Companies have been trying to drive the costs of delivery down for years, embracing practices like off-shoring development work and no-code tools. To an industry where the ability to deliver something to market is often a sufficient differentiator, GPT4 seems like the ultimate solution to our problems. It is now possible to imagine a future where AI tools make outputs nearly instantaneous.
But delivery is far from the only remaining bottleneck to creating customer value. To meaningfully benefit from AI, your organization’s decision-making humans need to be able to keep up with it.
The only way to survive in an environment where a thousand GPT-generated apps demand the attention of your customers will be to fight quantity with quality — and that will require learning how to figure out what quality means to your customer, and whether that definition of quality overlaps with business needs.
The productivity of teams in the age of AI will be measured by whether they stage the right experiments and by how quickly they can learn from the results. To remain competitive, product managers need to familiarize themselves with applied organizational learning theory — otherwise known as design.
Single-loop learning and the race for velocity
“The more efficient you are at doing the wrong thing, the wronger you become.” –Russel Ackoff
The effectiveness of a decision-making loop is defined by two variables. The more obvious one is velocity. Tighter feedback loops result in quicker progress towards the goal. But the other variable — the number of layers in your organizational learning loop — is usually the more impactful of the two.
Consider a typical product team that has adopted the recommended SAFe mechanisms in an effort to improve their velocity. Going through product increment planning lets the team define their work crisply and remove blockers ahead of time. This is the first learning loop: are we doing a good job of delivering the outputs we defined?
Imagine that this team acquired a new AI-powered tool, which automatically wrote code. Instead of assigning tickets to developers, the team could just plug their Definition of Done directly into ChatGPT, and get the necessary code written in seconds instead of days. Their velocity would be off the charts, and they would certainly deliver all the specified outputs on time. Sounds nice — if the number of outputs was their ultimate goal.
If this team followed SAFe, their success would also be measured quarterly — via lagging indicators such as revenue or NPS. No matter how quickly they ship, even if they could deliver the next version instantaneously, this team would still have to wait three months to find out if they did a good job, and determine the requirements for the next release.
The best they can do in the meantime is say, “hmm, that’s not right,” and do SAFe even harder or make some small tweaks to find a local maximum.
In other words, while this team was able to tighten their first loop, they are still constrained by the next level of organizational learning. Without a way to define what to build (and more importantly what not to build), being able to deliver features at the speed of ideas becomes a curse rather than a blessing. No amount of AI enhancements to the tools this team uses for producing outputs would allow them to course-correct faster and thus achieve organizational outcomes more quickly.
Double-loop learning and opportunity cost
“You can do anything, but not everything.” –David Allen
The second organizational learning loop is selecting the right approach to reach the goal. The most famous double learning loop is OODA (Observe, Orient, Decide, Act). If single-loop learning represents simply observing (“are we done yet?”) and acting (“do more work”), the double-loop adds two additional steps: did the work we already do get us closer to our goal? and should we change the type of work we are doing? Intuitively, adding extra steps to a process feels like it will slow it down rather than speed it up, but don’t forget that slow is smooth, and smooth is fast.
Unfortunately for most product teams, the throughout of this decision loop is constrained by a scarce resource: user attention. To measure the effectiveness of something they have shipped, they must wait for usage metrics to come in. If one is measured on the number of monthly active users, it will always take one month before a meaningful uptick can be detected.
There is a line of thinking among proponents of LLM tools that GPT has sufficient reasoning skill to operate at this level of decision-making. If GPT can pass the Bar, surely it can make business decisions — or at least tell us about the problems with our website. Then we’ll be able to generate the right product instantaneously, right?
Well, not so fast. Not only are attempts to emulate user research with LLMs doomed from the start but your competitors are going to have access to exactly the same insights as you. The human in the decision-making loop is there to stay because that human is the only edge any company is going to have over a competitor.
Unlike software, humans don’t scale elastically. While the number of AI queries you can run is only constrained by your AWS budget, you only have so many people (employees and users) to ask questions. Any question you choose to ask incurs opportunity cost — the price you pay is not having enough time to ask some other question.
Fortunately, there is a discipline where opportunity cost constraints are a given. This discipline has developed a powerful mechanism to short-circuit second-loop learning and take action long before the trailing indicators have rolled in.
After two decades of “thinking like a designer,” it’s time to learn how to critique like one.
Steering the loop with design critique: the most important customer benefit
Common conception of the design process oscillates between two phases: the designer doing design, and the designer testing their work with users. Critique is the crucial layer between these two activities that exists to optimize the usefulness of insights gained from research by strengthening the rigor of the designer’s thinking.
Only poorly-done critique hinges on “good taste.” Professional critique refines not just a visual artifact, but the shared mental model which defines what “good” means in the context of the problem at hand. This mental model is what connects the dots between business goals like “increase monthly active users (MAU)” and definition of done requirements like “follows the product’s established interaction patterns.”
Design critique can be applied to the outputs of an LLM just as easily as to designs created by hand. To someone familiar with design critique, the outputs of GPT models look a lot like the work of a junior designer who can produce visuals but struggles to articulate why. They both make decisions because they have seen it done that way before — but they don’t know why doing it that way made sense in that context.
To guide design decisions towards desired outcomes, critique always begins with the question — “what was your goal?” What outcome were you hoping for, what problem were you solving, what opportunity were you pursuing? This is a critical question to ask, because when solving wicked problems the problem framing can never be taken for granted, and can evolve throughout the design process.
After framing their goal, a designer in critique will explain the primary user benefit their work was trying to provide; in other words, what missing capability was causing the problem. Different solution concepts are then compared to one another based on how well they deliver that primary benefit.
There are two common critique questions that designers ask to identify gaps in this thought process:
- How does this solution achieve the goal? If the answer is unsatisfying, the solution concept may be poorly framed. The hypothesis for how it solves the problem is missing.
- Is there another way to provide that benefit? If the answer is “no,” your opportunity may have been framed as “users don’t have the feature we want to build.” The framing itself needs work.
Experiments and leading indicators
“If you can’t judge the quality of the answer, asking is pointless.” –Amy Hoy
Every solution is an assumption, and even the best design critique can only refine that assumption. The designer’s adage “the user is not like me” has never been truer than when the designer is a machine with no lived experience of its own. This is why we need the third critical feedback loop of the design process — audience testing. And of the three loops, its throughput is constrained by opportunity cost the most.
There is a reason that critiques focus on helping the designer narrow down the number of ideas: regardless of how many you produce, there are only so many user eyeballs to evaluate them. Whether you have ten options from a human designer or ten thousand from GPT, only a few of them can go through proper, high-quality testing (and doing bad testing instead is not the answer).
Jumping straight from setting an objective like “increase NPS” to testing potential features that might increase NPS is a poor experiment because the vast majority of results will be inconclusive. Was the execution itself bad? Was it providing an unnecessary capability? Or was the entire problem that the feature was solving a non-issue for customers to begin with?
Instead, designers frame a separate hypothesis for each of these questions, and test them in sequence. Using low-fidelity artifacts like scenario storyboards to cheaply evaluate the magnitude of a few pain points will avoid any confounding factors and help align the team around one most important problem to solve. Similarly, research into potential capabilities to solve the problem will identify one primary user benefit that would be valuable to deliver. And then it would be trivial to put the question to GPT: how do we provide this exact benefit to the user?
Of course, the integrity of this entire process depends on conclusive evidence — leading indicators — being available. Without it, we are back to the team that waits 3 months for their NPS benchmark or 1 month to update their MAUs. These experiments are only as valuable as the accuracy of the team’s proxy metrics, which brings us to the third loop of organizational learning.
Triple-loop learning and the desirable future
“The compass determines direction. The navigation determines the route. The route leads to the destination. In that order. The order is key.” –A.R. Moxon
In a world obsessed with making outputs easier to achieve, the most important question — whether the consequences of those outputs bring us closer to what we actually want — often falls by the wayside.
The tools to set great outcome-based goals already exist. Unfortunately, as feature teams applied these tools in an effort to achieve agile transformation without reforming their strategy, they suffered drift. Because feature teams had no accountability for outcomes, they used these tools to measure their outputs instead.
But in a world where AI tools generate code with unlimited velocity, “outcomes over outputs” stops being aspirational and becomes existential. Managers will have to re-learn how to set measurable outcome goals (what John Cutler calls inputs to the North Star metric) and form a useful hypothesis for what opportunities the business should pursue to achieve those outcomes.
With a human delivery team, managers could get away with very fuzzy requirements, and rely on their reports to work out the details. While not very helpful, feedback along the lines of “I’ll know it when I see it” was at least sufficient to get those teams thinking about how else the deliverable could work. The thought process of “what is it?” could be displaced from the stakeholder to the designer. Humans working on the output could fall back on their mental model of user needs to fill in the gaps — and when that mental model differed from the leader’s understanding, they could push back.
But this displacement is not possible with a statistical model, because statistics cannot reason. No matter how advanced tools like GPT become, the underlying technology will never be able to interpret the intent behind the prompt, or tell you that you are wrong to want that thing.
To articulate what they want their AI tool to produce, managers will require a crisp understanding of what outcomes they want to achieve. And the same process designers use to govern the second loop can structure the thinking necessary for the third.
Applying the design process to the third loop
The same tools that help designers define a mental model around a single product or service can be applied to form a mental model on a higher level of abstraction: business strategy across one or more products. In the same way that design critique could connect the dots between the user goal and the right way to achieve it, it can help find the path between the business goal and the necessary inputs that will move us towards it.
Analogous to identifying a valuable problem to solve, a business leader needs to be able to set a valuable North Star: a leading indicator that is a proxy for a self-evidently valuable metric like retention or revenue. Determining this indicator is not a nice-to-have; it is literally the first line of the job description. In the language of OKRs, this is the Objective — the thing we have decided we want to achieve.
Next come the input metrics to the North Star (in OKR terms: the Key Results). Together, moving these metrics in the right direction should roll up to accomplish that crowning achievement. The parallels to the primary user benefit should be obvious: we do not necessarily know how we will accomplish these, but we know that we want to try because it’s the best path to the desired outcome.
And finally, there are the levers that we think we can move to achieve those results. In healthy organizations, top leadership has some theory of victory — a conception of some overlap between the levers their org can move, and the ones that will lead to positive impact on the input metrics. But just as designers listen to user feedback, executives should expect their product teams to have their own thoughts about the right levers to pull — and at all costs resist the urge to tell them how to pull those levers, which would be analogous to designers dictating goals to the user!
Quadruple loop learning
There is yet another level of organizational learning to which a company may aspire. The fourth learning loop covers how an organization learns to learn — how quickly it can ingest information about the current state of the world and re-generate the goals it sets for its third, second, and first loops.
Recent developments in productizing AI — such as Microsoft enhancing Bing search with Chat GPT — point to a future in which these tools can function at the level of this learning loop. But as with the other loops, language models cannot help us make decisions that secure a unique advantage in the market. Only the integrity of our thinking — established via effective application of the design process — can do that.