Getting Real With LLMs

This is my attempt to map real-world engineering tasks into a 2*2, and show where today’s LLM/agent tools actually work,
and where they are smoke and mirrors.

It’s pretty hard to tell what’s real and what’s hype when browsing X, watching Youtube videos and reading blog articles.
Few people who blog with “the truth” at heart. It’s much more catchy to oversell
LLMs, or oversell how much the ecosystem will change.
Then there is the FOMO – seems like every day there’s a new model, new agent, new paradigm shift.

And your company – they don’t want to be left behind, they also have investors who are waiting to hear about
how much your company is adopting AI and how it’s changing the workplace.

And so there you are, trying to cope with reading and using the latest tool, model, paradigm shift.

But then – there’s the real life. In real life coding at a company, you are not creating an app from scratch
every morning. You have real applications in production, with hopefully paying customers who are not so
patient with explanations about how AI is changing the world while your services are down.

And your apps are messy, you moved fast as a startup and have a lot of technical debt you are barely able to cover.
Sure, maybe one day LLMs will be able to take care of all of that on their own, but that day is not today.

In this post I want to map real-world engineering work into a simple 2×2 matrix and show where today’s LLMs actually help,
and where they mostly just add risk.

As a programmer (and manager) you have features to deliver, bugs to fix, software and infra
updates to do while maintaining your production SLA and keeping your customers and the business org happy.

Isn’t that what enterprise software is about?

The measurement problem

I’m not going to go into details about how to measure “AI adoption” or “AI readiness” because there’s enough
material written on that and I’m not an expert. But I am a big believer in Goodhart’s
Law.

“When a measure becomes a target, it ceases to be a good measure”. People will find a way to game that measure,
and it stops being useful, even if useful in the beginning.

Imagine you try to track “tool usage”, and everyone uses Cursor. Does that make software devs more productive or your
deliveries better, quicker or stabler? Just because there’s a hype online about it, and even if developers feel they are
now writing code faster, it does not mean you can translate it into a concrete business metric.

Most “AI adoption” metrics people invent ignore the actual shape of work.
They count prompts and tool usage instead of asking “which quadrant of work are we actually moving?”

For example, if you set ’% of commits touched by AI’ as a target, engineers will route everything through the tool
just to tick the box, while actual lead time / defect rate barely move.

The problem-LLM matrix

This is the way I currently perceive where and how the current tech can actually help
with real-world programming at the office:

Let’s start with explaining the axis:

X axis examines the task complexity. Some tasks are simple and scope is small, at the extreme
they are simple word/line edits on a single file. At the other extreme, complex tasks require multi-project
multi-tech changes and even infrastructure changes.
Y Axis tries to look at whether there are potential side-effects from completing the task. On the lower end
we have tasks with absolute no side effect, nothing in the outside world breaks or changes as a result of the task.
On the other hand we have tasks with extreme side effects, downstream or upstream (or both) systems can break from your
change. Task can easily affect user-experience, or any kind of business metric. And things can break in unexpected ways.

If we try to correlate whether a task has side-effects or not, we can also say that the Agent/LLM can run in a closed
loop or an open loop – can we run the model and it can understand all possible side-effects (know them, affect them),
and when there are many or unknown side-effects, it’s an open loop where the model simply does not know (and even the
human) about what this particular change/task will do.

Closed loop: There’s a reliable way to check the result automatically (tests, lint, build, maybe a sandbox environment + monitoring).
The agent can run and then “ask the world” if things are OK.
Open loop: no reliable way to know quickly if we broke something, or the blast radius is unknown.

In practice, closed-loop tasks tend to be low side-effect, and open-loop tasks tend to be high side-effect,
but the axis is about the side-effects/risk, not just about tests.

Let’s go over the quadrants:

Simple and closed loop tasks

These are basically a solved problem with current tools. With today’s models and basic wiring (tests, lint,
CI), you can get close to full automation.

Examples:

Fixing a logical bug in a function/file.
Improving performance runtime deterministic code.
Decreasing size of a docker image where the changes are non-functional and/or can be easily tested
Upgrading dependencies in an isolated project that has good enough CI/tests
Implementing a feature – like adding another tab to a UI or even another endpoint to back-end server.

I want to argue that a lot of tasks we consider complex are actually simple, or can be made simple.
And a lot of tasks we consider “open loop” are actually closed loop and current tech knows how to handle them if
you teach it (run tests with make test, build docker image with make build, run lint with make lint).

Complex and closed loop tasks

The canonical example here is creating a brand new application, or an isolated microservice. You might be
creating from scratch a back-end, UI, even a database. But they are isolated and not connected yet to the world.

The LLM-based generation tools (Lovable, v0 and friends) shine here,
even though these tasks are complex, because they are “effect-free”. This is where vibe-coding shines –
exploring and iterating quickly while the blast radius is small or none.

The important thing is that there aren’t side-effects from your task, as complex as it may get.

This is why with current tools & models, a lot of these type of tasks are already achievable today. Of course the tech
will get better over time and make working on these tasks even more straightforward.

Simple and open loop tasks

This is where the problem-space gets interesting, and likely holds a lot of value for the business, and unfortunately
where a lot of systems that have tight coupling with many other systems sit. Developers with mature production
applications are probably thinking today – LLMs cannot help me, if I change this one thing anything can break.

I want to give an example from my current workplace, where we have an incredibly sophisticated (and complex)
decisioning system with hundred of attributes that are calculated in an efficient topology.

A lot of these attributes are analytical in nature, and most of them, are plain old side-effect free and deterministic
code. Let’s assume an analyst wants to change the name of an attribute. It’s a simple word-rename. Several character
change.

This is as simple as it gets. However – the amount of side-effects that can happen due to this change is
extremely high. Most of our downstream systems, from data processing, research, billing, customer-dashboard might be
reliant on this attribute, to either calculate something or display something. And you might have no idea that something
broke due to your change. It might create a latent issue that will only be uncovered later (if at all).

Telling an agent to make the change is simple and quick, however teaching it how to fathom the side-effects, that is
extremely difficult, and a lot of times relies on the fact that you, yourself, know them in the first place. Not all
side-effects are even mapped and things can be related through another proxy attribute.

This quadrant is indeed difficult, even for the simplest of tasks. With today’s tools we are struggling
to make these type of changes.

Another example:

You upgrade a dependency in a project, that is also used in another system. An example we have in my company,
is if you upgrade a python project, that both runs as standalone, but is also executed in a spark runtime in another
project. The upgrade within the project is simple enough, but understanding what else you might break because your code
is executed in other places, is the struggle.

Some ways I find that make coping with this quadrant easier (but won’t fully solve it):

If you are able to teach your model about the side-effects, for example: When you change this code, you also need
to test the change in another project that is using this code. Or if you change this thing, you need to look at our
monitoring solution to ensure nothing else broke or affected.

Sometimes you are able to teach the model about side-effects, but it’s hard to get it to test them. There is a
difference between knowing about the side-effect and testing it. Sometimes in order to test something you actually
need to deploy your system so then the change becomes more dangerous.

You are able to break apart system and data dependencies, or at least decouple them enough. This might reduce the
side-effect to a point where your model no longer needs to know or worry about them.

The options above basically create a down-movement of making open-loop tasks into semi (or completely) closed loop
tasks. Acting here is an art that is not yet fully developed but I assume the industry will create better
systems, practices and tools to improve our results here.

If you are a developer, and you don’t understand the LLM hype because your system has many side-effects, you are not
alone. I’m assuming a lot of seasoned developers feel the same. Know that it’s a nut worth cracking.

Complex and open loop tasks

This is the holy grail, one which we may never reach or might take a long time. How can an agent make changes
to a complex task in an open loop system which it doesn’t know about side effects ?

We don’t yet know how to really tackle these, but know that there is no magic here.

It’s either breaking down a complex task into simpler tasks, or reducing the side-effects like we talked about
in the 3rd quadrant before.

I’m not going to talk much about this because it’s something we haven’t tackled yet and I haven’t yet seen
this done in a successful and reproducible way.

If you are extremely risk-tolerant, perhaps you don’t mind that things can break. Maybe you don’t yet have customers
on this product. In that case let that agent loose and you might be able to make it.

What to focus in 2026

My own focus, and what I think developers should gain expertise on in 2026, is making sure they know how to
recognize tasks in the first quadrant, and especially how to solve them with today’s tools.

You’ll be surprised at how many things can be solved today without any tech advancement, and how many people
around you are still using 2020’s tools to solve what can be solved easily with today’s agents.

Another key focus is understanding when a complex task has no side-effects which can already be solved pretty
well with today’s tools. And focus on trying to crack the 3rd quadrant.

My guess is that the industry will try to devise strategies, tools and best practices on how to tackle
those simple and open loop tasks, so you are not alone.

If you are an architect or have a bigger role responsibility, you also need to think on where it’s possible
to break apart coupling of systems and data. This can reduce side-effects for developers and allow them to solve
the tasks much easier.

BeauLebens.com

An aggregation of Beau on the internet

Getting Real With LLMs

The measurement problem

The problem-LLM matrix

Simple and closed loop tasks

Complex and closed loop tasks

Simple and open loop tasks

Complex and open loop tasks

What to focus in 2026

Shortlink:

Like this:

Similar Entries

The measurement problem

The problem-LLM matrix

Simple and closed loop tasks

Complex and closed loop tasks

Simple and open loop tasks

Complex and open loop tasks

What to focus in 2026

Shortlink:

Share this:

Like this:

Similar Entries