Essay №08 AI · change management · strategy

The Ghost Competence

Deploying agents from day one doesn't remove the foundational learning phase — it makes it invisible, producing workers who look productive but never built the judgment to supervise what they delegate.

All else being equal, A340 crews — who averaged 14,969 total flight hours — performed significantly worse than A320 first officers, who had only 3,438.[1] The least competent airline pilots on a manual-flying test were not the beginners. They were the most experienced captains: those who had spent years at the controls of a highly automated Airbus A340, accumulating flight hours without accumulating active practice. The Fleet factor — a proxy for how often pilots landed manually — explained 45% of the variance in flying performance; the Rank factor — a proxy for experience and seniority — explained 8%. This inversion — more hours, less competence — is not specific to aviation. It is the structure of all learning in domains where competence is partly tacit and acquired through active practice against real cases, once that practice is replaced by an automation one never directly confronts. Agentic AI reproduces this mechanism at unprecedented speed and scale: organisations that deploy agents from a worker's first day on the job create, at scale, users who resemble the A340 captains — productive on the surface, miscalibrated underneath, and unable to perceive their own deficit.

This is not a matter of individual motivation or curiosity. Two distinct cognitive mechanisms block the formation of competence in an agent-first regime; a third prevents the miscalibration from producing any observable signal. The argument of this essay is structural: agent-first deployment does not remove the foundational learning phase — it makes it invisible.

Expertise is not formed by exposure

The conceptual error at the heart of agent-first deployment is to confuse familiarity with competence. A user who interacts daily with an agent for six months accumulates exposure — a familiarity with output patterns, with the interface, with the kinds of response the model usually produces. What they do not accumulate, for lack of opportunity, is the competence of supervision: the tacit schemas that let one detect when an output crosses the line between the plausible and the correct.

The distinction is old in cognitive psychology. Ericsson, Krampe and Tesch-Römer established in 1993, on the basis of two longitudinal studies of violinists and pianists, that expert performance is not the product of raw experience but of deliberate practice — a "highly structured activity, the explicit goal of which is to improve performance: specific tasks are designed to overcome weaknesses, and performance is carefully monitored to provide cues for ways to improve it" (p. 368).[2] The phrasing looks technocratic. It describes a real cognitive constraint: "In the absence of adequate feedback, efficient learning is impossible and improvement is minimal even for highly motivated subjects."[2] Exposure without granular feedback, without repeated confrontation with error, without active attempts in the zone of one's competence limit — generates no expertise. It generates familiarity.

"The mere repetition of an activity will not automatically lead to improvement in, especially, the accuracy of performance."

— Ericsson, Krampe & Tesch-Römer (1993), p. 367

The empirical datum is precise: the best violinists at the West Berlin Academy of Music had accumulated 7,410 hours of deliberate practice by age 18, against 5,301 hours for the "good" violinists and 3,420 hours for the future music teachers — a 2.1-to-1 gap between the first and third groups, entirely attributable to the quality of practice, not to the number of years of exposure.[2] It is not the quantity of hours spent with an instrument that forges expertise; it is the structure of those hours: active attempt, visibility of error, calibrated feedback.

Agentic AI removes exactly these three elements. It produces outputs in the user's place — who makes no attempt. It absorbs intermediate errors before they become visible — so the user receives no granular feedback on the model's sub-processes. And it delivers a result that looks like the right answer without exposing the heuristics by which it got there — so the user cannot "attend to the critical aspects of the situation."[2] The agent-first user is not incompetent; they are functionally productive. That is precisely the problem — in the long run, and above all for the junior population.

A second mechanism reinforces the first. Kalyuga, Ayres, Chandler and Sweller documented what they call the expertise reversal effect: instructional techniques that work with novices can lose their effectiveness — and even produce negative effects — when used with more advanced learners.[3] The mechanism is precise: "Cross-referencing and integrating related redundant components of information will require additional working memory resources and may cause a cognitive overload. This additional cognitive load may occur even if the learner recognises that the instructional materials are redundant and decides to ignore them as much as possible. Redundant information is frequently hard to ignore."[3]

Call this mechanism cognitive redundancy — it is more precise than the popular notion of overload. Agentic AI does not overload the user; it renders them cognitively redundant relative to their own partial schemas. As soon as a user has acquired a fragmentary understanding of a task, the agent performing that task in their place continually presents an already-formed output that their brain must process in parallel with its own nascent thinking — consuming the working-memory resources that would have served to consolidate autonomous schemas. The guidance cannot be ignored; it imposes itself cognitively even when one wants to set it aside.

The two mechanisms add up; they do not overlap. Ericsson blocks the formation of the founding schemas: without active attempt and calibrated feedback, the schemas do not form. Kalyuga blocks the residual formation that might have happened anyway: the ever-present agentic scaffolding competes cognitively with the partial schemas, preventing their consolidation. This is not a mere learning delay. It is a double structural block.

Confidence without a learning loop

Competence is not enough to supervise an agent. One also needs calibrated trust — trust that matches the system's real capabilities, that varies by task type, that adjusts on observed performance. Calibrated trust distinguishes the case where the agent can be followed without verification from the case where it must be closely checked. Without it, supervision is random: it over-controls the reliable zones and under-controls the risky ones.

Lee and See (2004) established the founding framework of research on trust in automation by identifying the three bases on which an operator's trust rests: performance (what the system does — its competence, its reliability, the regularity of its results), process (how it works — the degree to which its mechanisms are appropriate and comprehensible), and purpose (why it was designed — the intent of its creator).[4] These three bases are not equally accessible. Unable to observe the process of an LLM — opaque by nature — the user grounds initial trust on perceived purpose: the tool's reputation, the deployment narrative, the commercial promise. This is trust by proxy, with no anchor in observed performance. Their central observation about calibration is counter-intuitive: trust is not built by decision, it is calibrated by use. "Because trust depends so heavily on observing the behaviour of the automation, in most cases the automation must be relied upon for trust to grow."[4] The loop is closed: without use, no observation of behaviour; without observation, no calibration; without calibration, trust stays stuck at the level of initial faith in the tool's reputation. The agent-first user who delegates immediately never enters this cycle. Their trust is real — it is simply not calibrated to reality.

From this architecture follow two properties trust must satisfy to be operationally useful: resolution and specificity.[4] Resolution is the precision with which the trust assessment matches the system's real competence profile: high-resolution trust is differentiated (the user knows the agent is reliable for X but not for Y); low-resolution trust is global and undifferentiated. Specificity is the anchoring of that differentiation in precise task categories, not in general impressions of tool quality. Together these properties allow the crucial distinction between trustworthy automation (reliable in its successes, but not necessarily comprehensible in its failures) and trustable automation (whose limits can be learned, mapped and therefore corrected). Current agentic LLMs are highly trustworthy within their domain of competence — their outputs are convincing, their errors rare on the surface — but structurally not very trustable: their failures are not deterministic, not predictable from the output alone, not correlated with signals the user could learn to read. The result: low-resolution trust, with no task specificity, and hence no discriminating capacity where it is needed.

Lee and See note a modulating effect that deployment discourse systematically ignores: the decision to delegate depends not on trust in the AI alone, but on the differential between that trust and the operator's trust in their own capabilities (self-confidence). "A small difference in the predisposition to trust may have a substantial effect when it influences an initial decision to use automation."[4] Users with high confidence in their own competence tend to start with a lower initial level of trust in automation — which, paradoxically, engages them more in the calibration loop because they check before delegating. Conversely, those who doubt their own competence delegate more readily and form uncalibrated trust. This is where the mechanism becomes circular and especially dangerous for juniors: the eviction of the foundational phase described above erodes competence, and hence self-confidence; low self-confidence lowers the delegation threshold; and the more one delegates, the less one practises, the more competence erodes. Delegation becomes inevitable — not because the AI is reliable, but because the operator no longer trusts themselves to do otherwise. For juniors in an agent-first regime, both factors compound from day one: lower confidence in their partial schemas (beginner status) and maximal delegation. The configuration maximises precisely the low-resolution, uncalibrated trajectory.

Hoff and Bashir, in a qualitative meta-analysis of 127 studies published between 2002 and 2013, refine this framework by distinguishing three layers of trust in automation. Dispositional trust is a stable, context-independent trait — a general predisposition to trust technologies. Situational trust depends on the immediate interaction context — the interface, the tool's reputation, the first demonstration. Learned trust, finally, is the only one that reflects the system's real capabilities: it is built through repeated interaction, adjusts on observed errors, and incorporates knowledge of failure conditions.[5] It is this third layer that agent-first deployment structurally prevents from forming.

"People commonly assume that machines are perfect. Therefore, their initial trust is based on faith. However, this trust quickly dissipates after system errors; as relationships with automated systems progress, reliability and predictability replace faith as the primary basis of trust."[5] This replacement — of faith by calibration — requires observable errors. The agent-first user does not observe errors: they observe outputs. The model's intermediate errors — its hesitations, its abandoned paths, its hallucinations corrected before delivery — are invisible. Faith cannot turn into calibration because the calibration data never reaches it.

"Design for appropriate trust, not greater trust."

— Lee & See (2004), p. 75

Lee and See's formula is directly opposable to deployment rhetorics that maximise adoption without distinguishing enthusiasm from calibration. An organisation that measures the success of its AI deployment by usage rate may be maximising dispositional and situational trust — while doing nothing for learned trust, resolution, or task specificity. It deploys engaged users whose trust matches no learned reality. It confuses the indicator with the objective.

Hoff and Bashir document one further aggravating mechanism: the operator taken out of the loop progressively loses the ability to take back control. "Although high-level automation can perform tasks faster, human operators are 'out of the loop.' This means they can no longer prevent system errors; these must be dealt with after they occur. Moreover, out-of-the-loop operators can become dependent on automation to perform the tasks — and if failures occur, they will likely have much greater difficulty performing the tasks manually."[5] Dependence is not a metaphor; it is a cognitive mechanism documented across dozens of studies of military, medical and transport decision-support systems. Agentic LLMs reproduce it in any professional environment where they absorb the founding tasks before supervision competences have formed.

Degradation without a signal

What distinguishes the agent-first problem from an ordinary learning curve is its organisational invisibility. The organisation that deploys agents first sees productivity rise immediately — the short-term gain is real and measurable. What it does not see is the degradation of supervision competence: a deficit that surfaces only in edge cases, in the rare but costly errors, in the situations where the agent exceeds its reliability frontier and no one detects it.

Haslbeck and Hoermann's data document exactly this structure. In their study of 126 Airbus airline pilots, the A340 captains did not see their relative incompetence — they simply had no recent reference by which to perceive it. Two to four manual approaches per month are not enough of a signal about one's own gaps.[1] The miscalibration did not need to be measured to be structurally present: it was inevitable in the very geometry of the deployment. And the organisation — the airline — did not perceive it either. These captains' operational indicators were intact. Their productivity was intact. Only an objective test revealed the gap.

Haslbeck and Hoermann add a warning that organisations deploying agents should read carefully: "Such interventions should be applied in the early stages of a pilot's career before degradation can set in. Otherwise, avoidance behaviours and a feeling of discomfort with manual flying could lead to a negative spiral of increasingly rare manual-flying behaviour."[1] The degradation is self-reinforcing. The user who has not developed the founding competences progressively avoids the situations that would demand them; that avoidance reinforces the deficit; the deficit reinforces the avoidance. The organisation, which measures productivity and not supervision competence, sees nothing.

Beane documents the same structure in operating theatres. His ethnography across five North American hospitals between 2013 and 2015 reveals how robotic surgery restructured surgical training so as to make legitimate peripheral participation virtually impossible for residents.[6] The point worth stressing here is not shadow learning — the few exceptional residents who found workarounds to practise outside the official framework. These residents — twelve out of thirty-three, in Beane's data — constitute an active, intentional minority, a response to the constraints of the deployment. What this essay points to is the majority who do not improvise: the residents who watched the robots operate, filled in their training logs, obtained their certifications, and left the curriculum legally licensed but practically under-competent.

essay

The Hand That Sorts the Cards

A distinct angle on the same fieldwork: here, the eviction of the foundational learning phase for the passive majority; elsewhere, the platform's self-interested sorting of which practices survive — shadow learning persisted only because it stayed invisible to the platform, not because it was tolerated.

Read →

A senior surgeon's formula is blunt: "Watching a film doesn't make you an actor, you see what I mean? [...] The younger ones become deficient because they watch a lot and do nothing on the robot — and they become deficient at open surgery because most of the procedures move to the robot."[6] Observation, however attentive, however repeated, however motivated, does not engage the same cognitive circuits as execution. This is exactly the mechanism Ericsson identified in the West Berlin conservatories. Beane documents it in robotic-surgery operating theatres. Transposing it to agentic-AI professional environments is an inference, not direct proof — but the structure is rigorously homologous.

"A340 captains could not use their immense flying experience as an advantage for manual-flying tasks. Surprisingly, they performed rather worse than the lower-ranked pilots."

— Haslbeck & Hoermann (2016), pp. 16–17

What Beane adds to Haslbeck's quantitative data is the dimension of subjective invisibility. The residents who had not practised did not know they had not practised — or rather, they knew they had watched, and could not gauge the gap between watching and doing. One of them puts it in an admission of helplessness worth reading closely: "You also have this love-hate relationship with it, because for four years you watched others do it. For me, I'd been on the robot a few times, very scattered and never for long. And you really don't get the feel for it when you do it intermittently."[6] This is not ignorance of one's own incompetence; it is something more precise — the inability to assess the gap between what one has observed and what one would be able to do. Miscalibration is invisible from the inside.

In 2015, the average American urological surgeon practising robotic surgery performed one robotic prostatectomy a year (cited in Beane, 2019, p. 87 — "the average urologic surgeon in the U.S. who did any robotic surgery performed one robotic prostatectomy a year (Chang et al., 2015)"). The gap between certification and actual practice had widened to the point of being structurally absurd. The certifying organisation — the medical curriculum, the hospitals — had issued licences without ensuring the corresponding competences had formed. Agentic AI can produce the same result, more quietly, across hundreds of professional environments simultaneously.

In "Profession," Isaac Asimov imagines a future where knowledge is directly printed into the brain on professional orientation days. The regime looks liberating — no one suffers through years of laborious learning. The story then turns the intuition inside out: the creative elite is precisely the one that cannot be printed, forced to learn through trial, error and backtracking. What Asimov's fiction shows in the positive — the irreplaceable value of the path that forges competence — this essay observes in the negative: in today's agent-first environment, it is the formation through prior immersion that is short-circuited. Professionals receive competence "printed" as outputs produced by the agent, without ever having built it. They end up in the position of the "Franchise" voter — legitimately licensed, structurally miscalibrated — without the founding friction that, in Asimov, characterises the elite capable of inventing knowledge rather than consuming it.

The blind spot of agent-first deployment is not technical. Organisations that deploy agents from a worker's first day make a correct short-term calculation: productivity rises, outputs are smooth, the indicators are green. It is not a calculation error. It is a metric problem.

What these organisations do not measure is what failed to form: the tacit schemas that distinguish a correct output from a plausible-but-wrong one; the trust calibrated on real experience rather than on the tool's reputation; the competence to actively take back control in the cases where the agent exceeds its reliability frontier. These deficits produce no immediate signal — the ghost competence is precisely a ghost because it is visible neither to the user who ignores it, nor to the organisation that has no indicator to measure it.

The three mechanisms documented here are cumulative. The absence of deliberate practice prevents the formation of schemas (Ericsson). The continuous cognitive redundancy of the agent interferes even with the residual formation that frequent exposure might have produced (Kalyuga). The absence of formative interaction prevents trust from transiting from initial faith to learned calibration (Lee & See; Hoff & Bashir). And surface productivity masks the underlying degradation until edge cases reveal it — too late to correct without high cost (Haslbeck; Beane).

The limit of this reasoning must be named: the four domains brought together here — aviation, music, robotic surgery, the cognitive psychology of instruction — constitute a homology, not direct proof. No longitudinal study yet measures the degradation of supervision competence in professionals trained under an agent-first regime. The argument is structurally consistent across distinct contexts, but it remains, for now, a cross-domain inference awaiting empirical validation in the specific domain of agentic AI.

The question organisations do not ask at deployment time is not "how do we maximise adoption?" — it is "what must the user be able to do before they can delegate in an informed way?" This is not a question of training in the classic sense: it is a question of sequencing. Expertise, calibrated trust, and supervision competence are built in a precise order — not in any order, and certainly not by starting with total delegation.

The ghost competence does not show up in standard performance indicators. It shows up in edge situations: the costly error no one detected because no one was calibrated to recognise it; the critical decision delegated to a tool whose limits had never been learned; the crisis where the user who should take back control discovers they do not know how. At that moment, the competence is no longer a ghost — it is simply absent. And the organisation that could have chosen a different sequence at deployment time no longer has that choice.

The operational conclusion is not "slow down deployment." It is to recognise that the slow phase has a precise content — exposure to edge cases, low-stakes errors, visible real-time correction — and to design it deliberately rather than suppress it by default. A competent agentic deployment begins with a period of explicit calibration in which users work with the system on tasks whose quality they can evaluate without the system. Organisations that skip this sequence do not save time — they externalise the cost of miscalibration onto future decisions, where it will be harder to identify and costlier to correct.

[1]

Haslbeck, A., & Hoermann, H.-J. (2016). Flying the needles: Flight deck automation erodes fine-motor flying skills among airline pilots. Human Factors: The Journal of the Human Factors and Ergonomics Society, 58(4), 533–545. doi:10.1177/0018720816640394

[2]

Ericsson, K. A., Krampe, R. T., & Tesch-Römer, C. (1993). The role of deliberate practice in the acquisition of expert performance. Psychological Review, 100(3), 363–406. doi:10.1037/0033-295X.100.3.363

[3]

Kalyuga, S., Ayres, P., Chandler, P., & Sweller, J. (2003). The expertise reversal effect. Educational Psychologist, 38(1), 23–31. doi:10.1207/S15326985EP3801_4

[4]

Lee, J. D., & See, K. A. (2004). Trust in automation: Designing for appropriate reliance. Human Factors, 46(1), 50–80. doi:10.1518/hfes.46.1.50_30392

[5]

Hoff, K. A., & Bashir, M. (2015). Trust in automation: Integrating empirical evidence on factors that influence trust. Human Factors, 57(3), 407–434. doi:10.1177/0018720814547570

[6]

Beane, M. (2019). Shadow learning: Building robotic surgical skill when approved means fail. Administrative Science Quarterly, 64(1), 87–123. doi:10.1177/0001839217751692

Expertise is not formed by exposure

Confidence without a learning loop

Degradation without a signal

The blind spot the indicators do not measure