Why Essay Marks Are Less Certain Than We Pretend
Reflecting on my experience as a teacher and edtech founder, this essay explores why essay marking is noisier than we like to admit — and why the most valuable part of judgement is often the feedback, not the mark.
Quick Summary
- •Essay marking is inherently noisy: two careful markers can reasonably arrive at different marks.
- •That variability is usually noise (random judgement variation), not bias — and exam systems quietly acknowledge it.
- •In formative use, the most useful question is less 'Was the mark right?' and more 'Was the feedback helpful?'
- •AI often feels unsettling because it makes normal disagreement visible by offering a second plausible reading.
- •Used well, AI can act as a training lens that helps students make sophisticated thinking exam-legible.
A few years ago, as an experiment, I re-marked a set of essays I had marked the previous week. Same answers, same mark scheme, same me. A few of of them came out two or three marks different. Not because I had been careless the first time. I had marked them carefully, conscientiously, and with confidence in my judgements. The second reading was just as careful. Both marks were defensible. Neither felt wrong.
That experience has stayed with me, partly because it is so ordinary. Many teachers have a version of this, although few of us have the luxury of re-marking entire sets of essays just to satisfy our curiosity. We do not tend to dwell on it, because it sits uncomfortably with how we present marks to students: as precise, authoritative, and deserved. Acknowledging uncertainty feels like undermining confidence, even when that uncertainty is an inevitable feature of judgement rather than a failure of it.
I was reminded of this recently while reading The Drunkard's Walk: How Randomness Rules Our Lives, by Leonard Mlodinow (2008), a book about how often we mistake certainty and skill for what is, in reality, chance and noise. One of its central arguments is that in complex systems, particularly those involving human judgement, outcomes vary more than we like to admit, even when people are acting competently and in good faith. I think essay marking fits that description uncomfortably well.
What actually happens when experienced markers read the same essay
One of the more uncomfortable realisations, once you look closely at marking, is that disagreement between experienced teachers is not uncommon. Give the same extended essay to two conscientious markers, both familiar with the specification and both applying the mark scheme in good faith, and it should not be surprising if their marks differ by several points. This is particularly true in the middle and upper ranges, where strengths and weaknesses coexist and the task is no longer to count errors but to weigh quality.
This is not because mark schemes are vague or examiners are careless. It is because long essays demand holistic judgement. Markers are asked to decide not just whether particular elements are present, but how much they matter. How sustained is the analysis? How convincing is the evaluation? Do the conceptual strengths outweigh the structural weaknesses? These are definitely not questions with mechanical answers. They require interpretation, emphasis, and professional judgement, and that judgement inevitably varies.
Exam systems quietly acknowledge this reality. Tolerance bands exist. Remarks are permitted. Senior examiners talk about "best fit" rather than perfect alignment. None of this would be necessary if essays had a single, objectively correct score waiting to be discovered. Yet in schools, we often behave as though such precision exists. A mark becomes fixed, authoritative, and oddly fragile. Any deviation feels like an error rather than a reasonable alternative reading. Normal professional disagreement is treated as a problem, when in fact it is simply the visible edge of a much messier, human process.
Noise, not bias
This kind of disagreement feels unsettling, but it becomes easier to understand once we separate two ideas that are often conflated: bias and noise. Bias involves systematic distortion, favouring one outcome or group over another. Noise, by contrast, refers to random variability in judgement. These are the small, often invisible differences that arise even when people share standards, training, and good intentions.
This distinction is explored in Noise: A Flaw in Human Judgement, by Daniel Kahneman, Olivier Sibony and Cass Sunstein (2021), which shows that professional judgement varies far more than we tend to assume across fields such as medicine, law, and recruitment. The conclusion is not that professionals are unreliable, but that judgement itself is inherently noisy. The more complex and qualitative the task, the harder it is to eliminate that variability entirely.
Essay marking sits squarely in this category. Long responses rarely succeed or fail on a single dimension. They contain moments of insight alongside lapses in clarity, strong knowledge paired with uneven structure. Deciding how much each of these features should count is not something a mark scheme can fully specify in advance. At some point, the marker has to decide what matters most. Different markers, quite reasonably, will sometimes make slightly different calls.
Once we recognise this, disagreement stops looking like failure and starts to look like an inevitable feature of assessing complex work.
Why AI marking feels unsettling and what it exposes
This distinction matters because it changes how we interpret disagreement when AI enters the picture. For many teachers, the unease around AI marking is not really about technology. It is about comparison. When an AI produces a judgement alongside a human one, differences that were previously invisible suddenly become explicit. A script that once carried a single authoritative mark now has two plausible interpretations attached to it.
It is tempting to conclude that the AI must be introducing inconsistency into a process that was otherwise stable. But that gets the direction of causality wrong. The inconsistency was already there. What the AI changes is not the nature of the judgement, but its visibility. Before, a teacher's mark simply was the mark. There was no counterfactual to reveal how contingent that judgement might be.
AI removes that insulation. By placing two readings side by side, it makes normal noise feel like error. In some cases, the AI's judgement is clearly weaker or misaligned. In others, it is coherent, defensible, and uncomfortably different from our own. That discomfort does not tell us which judgement is correct. It tells us that we are dealing with a task where multiple reasonable readings are possible.
From "Was the mark right?" to "Was the feedback useful?"
Once disagreement is understood as a feature of judgement rather than a flaw, the focus naturally shifts. In formative contexts, the question "Was this the right mark?" is usually the wrong one. What matters far more is whether the judgement, imperfect as it may be, helps the student understand how to improve.
A mark is a blunt instrument. It compresses a complex response into a single number and invites comparison rather than reflection. Feedback, by contrast, has the potential to travel. It identifies patterns, surfaces misconceptions, and points students towards the specific skills they need to develop next time. That is what actually raises performance.
This reframes concerns about AI being "a few marks out". The more important questions become: does the feedback align with what examiners actually reward? Does it help students see what successful answers do, not just what they contain? And would acting on this feedback make the next essay stronger? Even here, there is a legitimate tension. What examiners reward is not always identical to what we might value in the subject at its richest, and the line between teaching to the exam and teaching good thinking is not always comfortable. That is a wider issue, and an important one, but not the question I am trying to resolve here.
In the context of formative practice, the immediate concern is whether feedback gives students clearer sight of how their work is being read, and what they can do to make it more effective under exam conditions.
When I first started working on Teach Edge, I assumed its main value would be productivity. Marking essays takes time, and any tool that reduced that burden felt worthwhile on its own terms. What I did not anticipate was how often teachers would prioritise the feedback itself over any time saving. Some barely use it to reclaim hours at all. Instead, they use it to give students clearer, more personalised guidance, even when that means reading, editing, and discussing the feedback in detail. That has been a quiet but important lesson for me. In practice, the value of judgement in formative assessment seems to lie less in efficiency than in how well it helps students see what to do next.
When sophistication becomes risky
There is a more troubling consequence of noise in marking that teachers are often reluctant to say out loud. At the very top end, the students who think most deeply can also be the most vulnerable to misjudgement. I have seen essays that are conceptually ambitious, carefully qualified, and genuinely thoughtful receive surprisingly modest marks. Not because the thinking is weak, but because it does not align neatly with the mechanics of a mark scheme or the expectations of a particular reader.
This is especially pronounced in subjects like economics, where strong answers often resist simplification. The best responses tend to be conditional rather than absolute, exploratory rather than formulaic. They acknowledge trade-offs, challenge assumptions, and follow lines of reasoning that are not always easy to summarise quickly. That kind of thinking is exactly what we want to cultivate, and yet it can be fragile under assessment systems designed to reward clarity, coverage, and recognisable structures.
Over time, teachers learn this lesson and become more cautious. Nuance is signposted more heavily than it would be outside an exam hall. Evaluation is taught in safer, more predictable ways. Not because we believe this is intellectually superior, but because we are trying to protect students from the consequences of being misunderstood. At this point, marking noise stops being an abstract issue and becomes a pedagogical one. We are forced into uncomfortable compromises between teaching the subject in its richest form and teaching students how to survive the exam.
Teaching for the world beyond the exam
This tension becomes sharper when we step back and consider the wider purpose of education. Initiatives such as the Gatsby Benchmarks encourage teachers to help students see subjects as pathways into further study and employment, not just as collections of examinable techniques. Economics, in particular, lends itself to this approach. It is inherently contextual, rooted in real-world policy choices, business decisions, and trade-offs that rarely resolve themselves neatly.
Having come into teaching after working as a professional economist, I value this applied dimension deeply. Bringing real-world context into lessons makes the subject feel honest and alive. But it also increases the risk under assessment. The more students are encouraged to draw on contemporary examples, qualify their arguments, and explore competing explanations, the further their answers can drift from the tidy structures that examiners are trained to recognise quickly. What looks like sophisticated applied reasoning to one reader can look like digression or imprecision to another.
As a result, teachers often throttle this back. Context is rationed. Examples are chosen for safety rather than richness. This is not because we doubt the value of real-world thinking, but because we are acutely aware of how easily it can be misread under the pressure and constraints of large-scale marking. Navigating that gap becomes part of the hidden curriculum of teaching: knowing when to open the world up, and when to narrow it again.
AI as a training lens, not a ceiling
One way of responding to these tensions, without abandoning ambition or pretending the exam system is something it is not, is to rethink the role AI might play in formative assessment. Rather than treating AI as a judge, it is more useful to see it as a training lens. That is, a way of helping students understand how their thinking is likely to be read under exam conditions.
AI is well suited to this role precisely because it is literal. It rewards what is made explicit rather than what is implied. It cannot infer generosity or fill in gaps. When it struggles to credit an otherwise thoughtful answer, it is often because the student has not yet learned how to signal their reasoning clearly enough for a time-pressured reader.
Used carefully, this can be powerful. Students can experiment with structure and phrasing and see how those changes affect the feedback they receive. They learn that sophisticated thinking is not enough on its own. It has to be made visible. Evaluation needs to be named. Assumptions need to be surfaced. Conclusions need to be tied back explicitly to the question.
This does not require teachers to dilute the subject or retreat from real-world thinking. On the contrary, it allows students to retain intellectual ambition while learning how to translate it into exam-legible form. The aim is not to make answers more formulaic, but more survivable. In that sense, AI can function as a rehearsal audience: consistent, unsympathetic, and therefore useful.
Why precision is the wrong obsession
Taken together, these threads point to a quieter but more important question. In noisy domains, precision is often the wrong obsession. Marks begin to look like measurements rather than judgements, and small differences take on an importance they do not really deserve. A script marked 18 rather than 20 feels as though something objective has gone wrong, even when both marks sit comfortably within the range of reasonable professional judgement.
This is understandable. Numbers feel solid. But in complex domains, precision can mislead. A long essay is not a temperature reading. Treating marks as exact encourages false certainty and distorts the conversations we have with students.
As argued in The Tyranny of Metrics, by Jerry Z. Muller (2018), problems arise not when we measure things, but when we mistake measurement for meaning.
Teaching honestly within imperfect systems
None of this is an argument for abandoning exams, mark schemes, or professional judgement. Nor is it a claim that AI has solved the problems of assessment. It has not. What it has done is make some long-standing tensions harder to ignore. It has exposed the variability in marking, highlighted the fragility of sophisticated answers, and challenged the certainty we were often assuming.
As teachers, we already work within imperfect systems. We balance ambition against safety, authenticity against legibility. AI does not remove those compromises, but it can make them more visible and therefore more discussable. Used carefully, it can help shift attention away from the illusion of the "right" mark and towards clearer expectations, more transparent reasoning, and feedback that genuinely helps students improve.
The aim was never perfect marking. The aim is to teach students to think well, communicate clearly, and navigate the assessments they face with their eyes open. Accepting uncertainty in judgement is not a failure of standards. It is an honest recognition of what teaching, and learning, in complex subjects actually involves.
Gary Roebuck is an Economics teacher and founder of Teach Edge.
Related Posts
The Difference Between "The Auditor" and "The Teacher" (and Why It Matters for Your Students)
If you've ever pasted an essay into ChatGPT and got harsh, pedantic marking, you've met 'The Auditor'. The fix isn't a better model — it's using the right kind of AI brain for the right job, with a rubric and best-fit judgement.
Stop Trying to Catch Them: Why AI Detection is a Dead End for UK Secondary Schools
AI detection tools cannot reliably prove whether a GCSE or A Level student used generative AI. Schools will get further by modelling good AI use and protecting supervised writing time.
From Chatbot to Co-Worker: What "Agentic AI" Actually Means for Teachers
Agentic AI sounds like jargon, but it points to a real shift: systems that can plan and take steps to complete a task, not just reply to a prompt. Here is what that means in classroom terms, and what to look for when you are choosing tools.
Ready to transform your marking workflow?