Watching Is Not Supervising

What it takes to stay genuinely in the loop

May 14, 2026

A follow-up to Passenger in the Driver's Seat

The Thread That Stayed

After Passenger in the Driver’s Seat went out, a thought surfaced that I hadn’t fully worked through when writing it.

The first post asked what happens to the driver when the car drives itself. It landed on an answer worth sitting with: the practice loop closes, the skill atrophies, and the practitioner is left asking what their role actually is. But a reader’s observation opened a different angle. If the car drives itself, the driver doesn’t disappear. They are still in the seat. Still nominally responsible. Still present. What we become is something between driver and passenger — and that something is worth examining more carefully.

Passive passenger doesn’t quite capture it. Neither does driver. The comments that came back reached for different words: director, observer, orchestrator. None of them settled. And that unsettlement is itself a signal. We do not yet have the right language for what the human role becomes in an agentic system — which usually means we haven’t thought it through carefully enough. What we become — in the language the first post used — is something closer to actant than agent: present in the system, bearing responsibility, but no longer the primary locus of decision.

This post thinks through what that participation — being present in an agentic system as a genuine contributor rather than a passive observer — requires. The argument: the role that remains for the human in an agentic system is fundamentally one of condition-setting. Not executing, not merely approving — establishing the constraints under which the system can be trusted to act, and under which the collaboration has a better chance of producing what was actually wanted.

What does it mean to participate well?

Supervised or Spectating?

What changes and what doesn’t when a car drives itself?

The destination is still chosen by the human. The purpose of the journey is still theirs. If something goes wrong — a route that seems unsafe, a decision the system makes that doesn’t feel right — the human is still the one who notices, who intervenes, who bears responsibility for the outcome. They are not absent. They are not irrelevant. But they are no longer doing the thing that used to constitute driving.

They are watching. Monitoring. Ready to act, but not acting. Present in the system without being the agent of its moment-to-moment decisions.

This is what the McChrystal Group describes as “eyes on, hands off” — the leadership posture in which you empower a system to act while maintaining the oversight that keeps it aligned with intent. In Team of Teams, McChrystal draws on the example of Admiral Nelson at Trafalgar to make a related point: Nelson’s genius lay not in his real-time command of the battle but in the years of preparation that preceded it — the doctrine, the drills, the shared understanding of intent — so that when the battle came, his captains could act without waiting for orders.

The commander’s role was not to direct each ship but to establish the conditions under which each captain could be trusted to act well.

The parallel to agentic AI is direct. Conversations about elevated roles — developers becoming product managers, architects, quality stewards — have this same structure. The tactical layer is handled by the system. The human provides intent. Something is right about this framing. The role does shift upward. The work does become more about direction and less about execution. But like Nelson’s captains, the system acts well or poorly depending on the conditions it was given — and setting those conditions is neither passive nor simple.

Which brings the question that I think is worth asking — and it comes from a different direction than Nelson’s battlefield. In machine learning, the difference between supervised and unsupervised learning is not about whether a human is present. It is about whether the system is receiving corrective signal. Supervised learning trains on labelled examples — the human’s judgment is embedded in the data, shaping what the system learns. Unsupervised learning finds its own patterns without that corrective input. The human can be watching in both cases. What differs is whether their judgment is actually reaching the system. The same distinction, it turns out, applies to agentic execution: is the human actually supervising the system, or only appearing to?

The Supervision Spectrum — Supervision is not a posture. It is a transfer of judgment - and where you sit on this spectrum determines whether you are genuinely in the loop or only appearing to be

When a practitioner builds a thoughtful context, maintains a well-constructed harness, and monitors output against clear evaluation criteria, they are establishing the constraints under which the system operates — their judgment shapes the system’s behavior continuously, even when they are not actively intervening. Their presence is genuine, not formal.

When there is no context, no harness, no evaluation criteria — when the practitioner simply describes a task and waits for output — the system runs without those constraints. It operates on its pre-learned assumptions, finds its own path to the destination, and produces whatever that path yields. The human approves or redirects at the end. But their judgment has not shaped the execution. They have been present without supervising.

Cruise control with your hands near the wheel, eyes on the road, aware of the conditions around you — that is supervised autonomy. You have established the governing constraints: speed, lane, when to override. The system handles execution within them. A car driving you while you read, unaware of what it is doing — that is unsupervised. You are in the seat. You are not in the loop.

Most practitioners, in my observation, believe they are supervising when they are not. The role says participant. The reality is spectator. And the gap between those two positions is not visible from outside — it only surfaces when the system drifts and nobody notices.

Making supervision genuine requires something that is rarely discussed: a working model of the system’s characteristic behavior. Not its architecture — but its tendencies. Where it tends to be confident and wrong. What kinds of context produce what kinds of drift. How it responds when the informational environment degrades. That knowledge makes monitoring meaningful. Without it, watching the system is not supervision — it is observation with no basis for judgment. The practitioner who has built that working model through sustained encounter with the system’s actual behavior is in a fundamentally different position from the one who hasn’t. One can supervise. The other can only approve.

If the system has the agency, what am I? — that was the question the first post left open. The more precise version: am I actually supervising, or only appearing to? This is the first condition of effective participation: whether your judgment is reaching the system at all.

The Temporal Dimension

Supervision is one dimension of condition-setting. Time is another — and equally consequential.

Driving is a real-time feedback loop. When the car drifts, you feel it through the wheel and correct it within seconds. The feedback latency is near zero. The learning — the gradual building of reflexes and pattern recognition — happens through thousands of these micro-corrections, each one closing the loop before the drift becomes a problem.

Most agentic workflows do not work this way.

An agent running a background task — processing documents, generating reports, executing a multi-step workflow — may complete the entire task before the practitioner sees the output. The loop closes not during execution but after. Sometimes long after. The practitioner reviews the result, identifies problems, and feeds corrections into the next iteration. The iteration is the unit of feedback, not the moment. And iterations are measured in hours or days, not seconds.

This changes what kind of learning is possible.

Real-time feedback builds reflexes. The driver who feels a thousand small corrections becomes someone whose hands know, without deliberate thought, when to adjust. That knowledge is available faster than conscious reasoning — what Gary Klein calls recognition-primed judgment, the felt sense that something is off before you can articulate why.

Latent feedback builds something different: analytical understanding. The practitioner who reviews completed outputs and diagnoses what went wrong develops a model of the system’s failure modes. That is valuable. But it is slower to build, less automatic in deployment, and less reliable in the moment when something is silently going wrong.

This is where the reporting workflow shows its weakness. A team used an agentic system to build a reporting workflow — weekly project summaries, risk flags, decisions needed. The early outputs looked good. Good enough that the team stopped reading them closely.

Six weeks in, the summaries were still arriving on time, well-formatted, professional in tone. Nobody noticed that the risk flags had quietly stopped appearing. The risks hadn’t gone away — the context had drifted. The documents, the ticket data, the status updates fed to the system had changed format or stopped being updated. The system, working with what it had, produced coherent summaries of an increasingly incomplete picture. It had no harness that required it to flag its own uncertainty.

The failure wasn’t dramatic. That’s what made it dangerous. A weaker system would have produced obviously wrong output — easy to catch in review. This one produced output that was plausible, well-reasoned, and wrong in ways that only surfaced when a decision was made on the basis of a risk that had never appeared in the summary.

Three conditions had degraded together. The intent had been set at the start and never revisited. The context had drifted without anyone noticing — not through negligence, but because nobody had built the habit of asking whether the informational environment still reflected reality. And the harness had no observability built in: no evaluation criteria flagging when risk items dropped below a threshold, no production monitoring signal surfacing the degradation before it became a decision problem. The feedback loop’s latency — weekly review, not continuous monitoring — meant that by the time anyone saw the problem, weeks of drift had accumulated.

A well-built harness does not only shape what the system does. It compresses the feedback loop, moving the practitioner from latent review toward genuine supervision. The practitioner who understands this designs their harnesses to surface drift early, not to diagnose it late. They are engineering not just the system’s behavior but the conditions under which their own judgment can operate in time to matter.

The Friction That Builds

The third dimension of condition-setting is the most counterintuitive.

Not all friction is an obstacle.

Some friction is a constraint in the productive sense — the resistance that builds skill, forces attention, and surfaces problems early. The driver who feels the car respond to road conditions — the slight pull on a wet surface, the resistance before a corner — is receiving feedback through that friction. It is information. Remove it entirely, and the driver loses not just discomfort but signal.

The same applies to human-AI collaboration. The effort of constructing a thoughtful context is friction. The discipline of building evaluation criteria into a harness is friction. The practice of monitoring actively rather than reviewing passively is friction. These are not inefficiencies to be automated away. They are enabling constraints — the conditions under which genuine participation is possible. They are where the practice loop lives.

Cognitive scientists have a name for this kind of effort: germane load — the mental work that doesn’t just consume attention but builds lasting knowledge and schema. Extraneous load is the friction to remove: the overhead of clunky interfaces, repetitive mechanical tasks, unnecessary complexity. Germane load is the friction to protect: the engagement with the problem that produces understanding, the encounter with the system that builds judgment.

When organizations optimize away the overhead of human involvement in agentic workflows, they often cannot tell the difference between the two. They remove both. What remains is faster output and a practitioner who is deskilling without noticing.

The question worth asking about any agentic system is not “how much friction has been removed?” but “which friction has been removed?” The friction of repetitive mechanical tasks — yes, remove it. The friction of active supervision, of maintaining context, of building and monitoring a harness — that friction is the practice loop. Remove it, and you have not made collaboration more efficient. You have made it formal.

This connects directly to the elevation narrative. Telling practitioners they are now operating at a higher level of abstraction, freed from the details of execution — that sounds like a removal of friction. And some of it is. But the friction that remains at the higher level is not the same friction. It is the friction of supervision: of maintaining clear intent, of building context that reflects the actual situation, of monitoring for drift before it becomes a problem. These are enabling constraints that hold the human-AI relationship together. Embrace them, and the practice loop runs. Remove them, and what remains is the approval loop in different clothing.

The engaged participant is not someone who has been freed from friction. They are someone who has learned which friction matters — and held onto it deliberately.

What Are You Becoming?

Kent Beck wrote that software design is an exercise in human relationships. In the age of agentic systems, the same observation extends to the human-machine relationship. The pairing is not symmetric: one party brings intent, judgment, and accountability; the other brings speed, pattern-matching at scale, and execution. And that relationship, like all the others, depends on the conditions you establish for it. The judgment, the attention, the accumulated understanding of how the system behaves — these cannot be delegated. They have to be earned, through practice, through encounter, through the kind of sustained engagement that changes the practitioner over time.

The first post ended with a question left deliberately open: if the system has the agency, what am I?

The answer is: it depends on the conditions you set. Whether you establish constraints that make your supervision genuine rather than formal. Whether you engineer the feedback loop so your judgment reaches the system in time to matter. Whether you preserve the productive friction — the effort of context, harness, and active monitoring — that keeps you genuinely in the loop rather than drifting toward spectator.

There are practitioners who do all three. They build context deliberately, so their judgment shapes the system’s execution from the start. They design harnesses that compress the feedback loop and surface drift early. They treat the effort of active supervision as the practice through which real expertise develops — not overhead to be minimized but condition to be maintained. They have developed, through sustained encounter with the system’s behavior, the recognition-primed judgment that only forms inside the practice loop: the felt sense of when something is right and when it has quietly moved away from what was wanted.

These practitioners are not passengers. They are working with a system the way experienced drivers work with capable cars — neither purely in charge, both necessary, the quality of what gets built emerging from the quality of the relationship and the conditions that hold it together.

And there are practitioners who are not. Who set intent vaguely and accept what comes back. Who use default contexts because building a real one takes time and judgment that hasn’t been developed. Who have no harness, no evaluation criteria, no signals that surface drift early. Who review outputs when the batch is complete and call that supervision. Who are, in the precise sense the first post described, in the approval loop rather than the practice loop.

Both carry the same role description. The difference is not visible from the outside, and not yet measured by most organizations. But it is real. And as systems become more capable, as the stakes grow higher, as the moments when human judgment matters most become more consequential, the difference will surface.

There is no single word for what the practitioners who get this right actually are — every candidate (director, partner, orchestrator) catches something and misses more. What matters is not the label but the relationship: whether it is genuine or formal, shaped by deliberate condition-setting or left to drift on defaults.

The practice loop has migrated, not closed. Whether it migrates well — whether practitioners establish the right conditions, whether organizations understand that the friction of supervision is not a cost to eliminate but a capability to protect — none of that is guaranteed.

The engaged participant is not a fixed role waiting to be filled. It is something you become — or don’t — through the conditions you choose to maintain.

Next: what happens when the system fails — the sharp edge, the blunt edge, and the conditions that determine whether failure is caught or missed.

Notes and References

Gary Klein — Recognition-Primed Decision Making Gary Klein, Sources of Power: How People Make Decisions (MIT Press, 1998; 20th Anniversary Edition, 2017). Klein’s recognition-primed decision (RPD) model describes how experienced practitioners draw on pattern recognition built through sustained encounter rather than evaluating options sequentially. The firefighter example in the first post, and the concept of naturalized judgment used throughout this series, draw on Klein’s foundational research. Klein has collaborated with Daniel Kahneman and his work has been widely applied across fields including military planning, nursing, and aviation.

Kent Beck — Software Design as Human Relationships Kent Beck, Tidy First? A Personal Exercise in Empirical Software Design (O’Reilly, 2023). The observation that “software design is an exercise in human relationships” opens Beck’s book and is developed through his Substack, Software Design: Tidy First? (tidyfirst.substack.com). Beck has noted he wasn’t entirely sure what the observation meant when he first wrote it — which is part of what makes it useful here: it holds open a question rather than closing one.

Stanley McChrystal — Eyes On, Hands Off and Team of Teams The “eyes on, hands off” framing is described by the McChrystal Group in their writing on empowered execution and shared consciousness: McChrystal Group, “Eyes On, Hands Off: How to Empower Your Team and Get Out of Their Way,” Medium, 2016 (medium.com/@mcchrystalgroup). The Nelson/Trafalgar example appears in Stanley McChrystal et al., Team of Teams: New Rules of Engagement for a Complex World (Portfolio/Penguin, 2015), p. 25 and surrounding discussion. McChrystal’s argument: Nelson’s genius lay not in his real-time command at Trafalgar but in the years of shared doctrine and trust that preceded it — his captains were “entrepreneurs of battle” capable of independent action precisely because the conditions for that action had been built in advance.

John Sweller — Cognitive Load Theory and Germane Load The germane/extraneous load distinction draws on John Sweller’s cognitive load theory, initially developed in: Sweller, J. (1988). “Cognitive load during problem solving: Effects on learning.” Cognitive Science, 12(2), 257–285. The three-way distinction between intrinsic, extraneous, and germane load was developed further in Sweller, J., van Merriënboer, J.J.G., and Paas, F.G. (1998). “Cognitive architecture and instructional design.” Educational Psychology Review, 10(3), 251–296. A note of intellectual honesty: Sweller later revisited the germane load concept, suggesting it may be more closely related to intrinsic load than originally formulated. The framework is used here as a useful lens for distinguishing productive friction from overhead, not as a settled taxonomy.

Supervised and Unsupervised Learning — The ML Parallel The distinction between supervised and unsupervised learning as used here draws on standard machine learning terminology rather than any single source. The application of this distinction to agentic execution — that the presence of a human does not constitute supervision unless corrective signal is actually reaching the system — is the author’s own extension of the framework, not a documented concept in the ML literature. Readers familiar with ML will recognize the analogy; it is intended as a clarifying frame, not a technical claim.

Context Engineering, Harness Engineering, Intent-Based Development These terms are emerging in the software and AI development community but are not yet standardized. Context engineering and harness engineering are increasingly used by practitioners working with large language models and agentic systems. Intent-based development has antecedents in autonomic computing and network engineering. Their use here reflects current practitioner vocabulary rather than established academic definitions.

Series context This post is the second in a series on agency, craft, and the human role in agentic systems. The first post, Passenger in the Driver’s Seat: What happens to builders when systems learn to act on their behalf, introduced the concepts of the practice loop, the approval loop, and the agent/actant distinction (drawn from actor-network theory, particularly Bruno Latour’s Reassembling the Social, Oxford University Press, 2005). Readers new to the series may find it useful to read that post first.

The Adjacent Possible

Discussion about this post

Ready for more?