What 200,000 marking decisions taught me about how we mark essays
After two years building an AI marking platform used by 400+ teachers, I've noticed consistent patterns in how marks get adjusted, where judgement calls cluster, and what actually helps pupils improve. The biggest surprise wasn't about accuracy. It was about feedback and time.
Quick Summary
- •Teachers most often nudge the mark by 1-2 points, but rarely rewrite the bulk of good, structured feedback.
- •Disagreement clusters around Application, because thresholds for enough context vary even among experienced markers.
- •The middle band is where most reasonable disagreement lives, and that is normal in essay assessment.
- •Draft-and-review reduces fatigue-driven generic comments, because editing is easier than starting from scratch.
- •The biggest win is speed-to-feedback, which makes revision loops more likely and more effective.
After two years of building an AI marking platform used by 400+ teachers, I've started to notice patterns in how essays get marked, by humans and by AI, that I didn't expect.
The biggest surprise? It's not really about the marks.
When I started building Teach Edge, I assumed the hard part would be "accuracy": get the AI to land on the same score a teacher would give, and you've solved the problem.
But fairly quickly you run into a more awkward question:
What does "accurate" even mean when experienced markers can disagree on a perfectly reasonable mid-range script?
Over time, as more teachers used the platform, I began to see the same behaviours repeating. Not in any one class or school, but as broad patterns across hundreds of thousands of teacher review decisions: small mark adjustments, edits to feedback, and where disagreements cluster.
A quick note on what I mean by "marking decisions". What follows is based on aggregated, anonymised patterns from how teachers interact with AI-drafted marking on Teach Edge. Things like how often marks get nudged, where feedback gets edited, and which assessment objectives trigger the most debate. This isn't analysis of individual pupils or identifiable work, and nothing here could be traced back to a specific student, class, teacher or school.
Here are six things I've learned.
1) Teachers tweak marks. They don't tweak feedback.
When teachers review AI-drafted marking, the most common change is a small nudge to the mark: up a point, down a point, occasionally two.
That part is expected. Essay marking involves judgement, and reasonable people disagree.
What surprised me was the other pattern: in most cases, teachers don't rewrite the feedback from scratch. They might tidy a phrase, add a subject-specific detail, or soften a line, but the bulk of the feedback often goes out largely intact.
That sounds minor. It isn't.
Because the feedback, the specific, criteria-linked, actionable bit, is what actually moves a student forward. Not whether it's a 14 or a 15.
And feedback is exactly the part that gets squeezed when time is tight. When you're marking 30 essays by hand, the first casualty isn't the mark. It's the quality of what you write underneath it. "Good analysis, develop evaluation" repeated fifteen times isn't really feedback. It's survival.
So when teachers are consistently comfortable releasing detailed draft feedback, it tells me something important:
It's not that teachers don't care about feedback. It's that, when the feedback is already specific and structured, they can finally do the job they always wanted to do without spending their whole evening writing from a blank box.
The mark gets the attention. The feedback does the work.
2) "Good application" is where the arguments happen.
Not all assessment objectives are equally easy to mark.
Knowledge is usually straightforward to credit: the student either demonstrates it or they don't. Analysis is often clear enough. Evaluation can be slippery, but markers tend to converge once they see a sustained line of reasoning.
Application is different.
This is the bit where mark schemes reward students for using context: real-world examples, case study detail, the specifics that show they're not writing a generic answer.
And in practice, it's the most subjective judgement in the whole process.
How much context is enough? Does naming "the UK economy" count, or does it need a specific policy, firm, market, or event? Is a relevant reference still "application" if it's undeveloped?
This is where AI and teachers often diverge. AI can be a little too generous: it sees a relevant reference and wants to reward it. Teachers are often more sceptical: "Yes, but it's thin. It's not doing any work yet."
Neither view is automatically "wrong". They're applying the same mark scheme with different thresholds for what counts as strong application.
The useful part is what happens next.
When the feedback clearly explains why the application is being credited (or not), and what would improve it, the student learns regardless of whether the mark falls on one side of a boundary or the other.
A line like this is worth more than a mark ever will be:
"Your reference is relevant, but it's still general. Next time, anchor it in a specific example and show how it strengthens your argument."
That's what students can act on.
3) Most disagreement lives in the middle band, and that's normal.
Top-band scripts are often easier to agree on. Bottom-band scripts are often easier too. The messy area is the middle: the "nearly there" answers, the ones sitting between two levels, the 12–16 out of 25 zone where there's genuine room for interpretation.
This is where teachers diverge most from each other, and it's often where AI and teachers diverge too.
That isn't a scandal. It's a feature of essay assessment.
Once you accept that a point either way in the middle band is normal, not because anyone is careless, but because the mark scheme leaves room for professional judgement, it changes what you optimise for.
You stop agonising over whether it's a 13 or a 14 and start asking a better question:
Does the feedback tell the student what to do next?
Which brings us straight back to Insight 1.
4) Fatigue makes feedback generic, and most of us don't notice it happening.
This isn't an AI insight. It's a human one.
In longer marking sessions, the quality of what we write declines. Comments get shorter. They become more generic. The same phrases appear again and again. The gap between the feedback on essay three and essay twenty-eight can be significant.
Not because the teacher cares less, but because fatigue changes how we think.
You sit down with decent intentions, mark carefully for the first hour, and then slowly shift into a mode where you're reading faster, writing less, and convincing yourself that "more detail needed" is self-explanatory.
A draft-and-review workflow changes the dynamics.
Editing is cognitively easier than creating. It's easier to improve a paragraph of feedback than to write one from nothing at 10pm. You're not staring at a blank comment box for the thirtieth time. You're reacting to something concrete.
The AI doesn't get tired on essay 28. The teacher still has the final say, but they're working with something rather than starting from nothing. Multiply that across a full class set and it matters.
5) The biggest gain isn't perfect accuracy. It's speed-to-feedback.
Before I built Teach Edge, I assumed the main value would be: "AI marks as well as you do."
Mark accuracy matters. It has to be defensible, and students deserve consistency. I'm not dismissing that.
But the most visible change, in how teachers actually use the platform, isn't about microscopic precision. It's about time.
The gap between setting an essay and returning feedback shrinks dramatically. Days rather than weeks. Sometimes the same day.
For students, that difference is huge.
Feedback that arrives while they still remember what they wrote, why they made that argument, what they were trying to say, where they got stuck, connects to something real. Feedback that arrives three weeks later often lands on a student who has mentally moved on. They look at the mark, file it, and forget the comments.
Fast, specific feedback tends to do more for learning than a "perfect" mark delivered too late to matter.
And the uncomfortable truth is: speed has always been the bottleneck. Not teacher expertise.
6) When the feedback gets better, students start asking for more of it.
This one genuinely surprised me.
Once students start receiving consistently detailed, criteria-linked feedback, something shifts. They begin asking if they can use the tool themselves.
Not to generate essays. Not to cheat.
To practise.
They want to write an answer, submit it, get quick feedback, see where they dropped marks, and try again. Without waiting two weeks for a class set to come back, and without feeling like they're being a nuisance by asking for extra marking.
That shift matters.
It turns feedback from something that happens to students on the teacher's timetable into something students actively seek out as part of revision.
It starts to look less like "marking" and more like a revision loop:
Write → feedback → improve → try again.
The teacher still sets the tasks. The teacher still reviews what matters. Professional judgement stays with the teacher.
But students get to practise more, because every extra attempt doesn't automatically mean extra teacher hours.
I've seen students voluntarily writing additional essays. Not because they were told to, but because the feedback loop made it feel worthwhile.
I didn't expect that when I started building a marking tool.
What this changes about how I think about marking
I used to think the goal was a precise mark. I still think marks matter. They need to be defensible, and students deserve fairness.
But accuracy was never the main constraint.
Time was.
Teachers have always had the expertise to mark well. They've always known what strong application looks like, what a sustained evaluation needs, and where a student's analysis falls short.
The problem wasn't judgement. It was capacity.
Thirty scripts. A Sunday evening. A stack that never quite disappears. And the knowledge that the bit that actually helps students, the feedback, would be the first thing to get compressed.
What's changed isn't teacher expertise. It's the ability to get high-quality feedback in front of students quickly and consistently.
And when the feedback is good enough that students start seeking it out for themselves, not because they're told to, but because it genuinely helps them improve, that feels like a more fundamental shift than any debate about whether an essay should be a 13 or a 14.
If your students have ever asked for more feedback (or you've noticed changes once turnaround time improves), I'd genuinely be interested to hear what you've seen, especially where you still don't trust AI-assisted marking, and why.
Related Posts
From Chatbot to Co-Worker: What "Agentic AI" Actually Means for Teachers
Agentic AI sounds like jargon, but it points to a real shift: systems that can plan and take steps to complete a task, not just reply to a prompt. Here is what that means in classroom terms, and what to look for when you are choosing tools.
What Makes Teach Edge Different from Other AI Marking Tools
Most AI marking tools stop at one script at a time. Teach Edge is built as an end-to-end assessment workflow for UK secondary teachers — from setting a task, to student submission, to draft marks and feedback, to teacher review, tracking, and reporting.
Introducing the Teach Edge Question Generator
Generate original exam-style questions, student-ready case studies from recent events, and matching mark schemes — then load everything straight into your Teach Edge Dashboard in two clicks.
Ready to transform your marking workflow?