Why raw AI chatbots can mislead when you ask them to mark exam essays
ChatGPT-style tools can give confident, plausible feedback. The problem is that they can over-mark exam essays because they don't reliably apply level-of-response mark schemes. Here's a simple Edexcel A Level Economics experiment that shows why.
Quick Summary
- •Raw chatbots can recognise "this sounds like Economics" but still misapply Edexcel's level-of-response rules.
- •A mark that's a few bands too high creates false confidence, even when the feedback reads well.
- •For 25-mark essays, undeveloped chains, decorative concepts, and thin evaluation usually cap marks quickly.
- •Use chatbots as a writing coach. Ask for missing chains and underdeveloped evaluation, not a headline mark.
Raw AI chatbots can be misleading when you ask them to "mark" exam essays
A lot of teachers are experimenting with ChatGPT-style tools for quick feedback. I get it. They're fast, they sound confident, and the feedback often reads well.
But there's a problem. Raw AI chatbots can be genuinely misleading when you ask them to mark exam answers, even if you tell them the exam board and the question style.
So we ran a simple experiment.
The experiment (Edexcel A Level Economics, Paper 1, 2024)
We took a real exam response from Edexcel A Level Economics Paper 1 (2024). This one had already been marked by the chief examiner in the examiner materials.
Question 7 (25 marks):
A small hotel in Scarborough has seen its energy bills increase from £2000 to £8000 per month. Small businesses do not have the energy management teams of larger companies to negotiate better deals.
Evaluate the microeconomic effects of rising energy bills on the hotel industry or an industry of your choice.
(Total for Question 7 = 25 marks)
Then we did three things:
- We told ChatGPT it was Edexcel A Level Economics, a 25-mark "evaluate" essay, and asked it to mark the answer out of 25 and give feedback.
- We ran the same answer through Teach Edge.
- We compared both to the chief examiner's mark.
Copyright note (important)
The question, extract, student response, and examiner materials are Pearson/Edexcel copyrighted (and the extract itself is adapted from a third-party source). To respect that, we're not reproducing the student answer here. Not as screenshots, and not typed up verbatim.
Instead, I summarise what the student did and focus on the marking principles.
The headline result
- ChatGPT mark: 17/25
- Chief examiner mark: 11/25
- Teach Edge mark: 11/25
This matters because a student reading "17/25" is likely to think:
I'm comfortably into a decent band. Just a few tweaks and I'm flying.
Whereas "11/25" tells a different story:
You've got relevant ideas, but they aren't developed in an exam-creditworthy way yet.
That gap is the whole point.
What the student did (in plain English)
The student's answer was sensible and on-topic. In summary, they:
- Explained that rising energy bills increase hotels' costs, reducing profits, and potentially pushing some firms (especially small independents) closer to closure.
- Suggested hotels might respond by raising prices, and referenced price elasticity of demand (PED) as a factor influencing how demand responds.
- Added a demand-side point. If households face higher bills too, disposable income falls, so demand for hotel stays may drop.
- Included Scarborough-style context and a brief comment about uncertainty/imperfect knowledge.
So why did this cap at 11/25?
Why this capped at 11/25 (3 reasons)
Edexcel 25-markers are level-of-response. This answer had relevant ideas, but it didn't develop them far enough to climb levels. In practical terms, it capped because of three things.
1) The chains of reasoning stop too early
The student's core line is essentially:
Energy bills rise → costs rise → profits fall
That's correct, but it's only the opening link.
To access higher levels, the answer needs to unpack the mechanism step-by-step, for example:
- Higher energy bills → AC/MC rise (or short-run supply shifts up)
- At the current market price, profit per room falls
- Some firms move closer to break-even/shutdown
- Longer-run: exit reduces market supply, changing outcomes over time
You don't need loads of theory. You need the theory you use to be properly developed.
2) Key concepts are mentioned, but not used to "do work"
The answer name-checks PED, which is promising. But it doesn't run the logic through properly.
Here, PED isn't a decorative term. It should be doing something like:
- If hotels raise prices and demand is elastic, total revenue falls
- If demand is inelastic, total revenue may rise (so costs can be passed on more successfully)
That's the difference between "I know elasticity exists" and "I can apply it to evaluate the impact".
3) Evaluation is brief and not weighed into a judgement
There are evaluative hints (PED, uncertainty). But evaluation at higher levels usually needs:
- at least two developed strands, and
- a conclusion that weighs them to answer "evaluate / to what extent?"
Typical strands here could be:
- Short run vs long run (limited adjustment now vs investment/efficiency later)
- Small firms vs chains (bargaining power, economies of scale, cash buffers)
- Strength of demand-side effects (income elasticity, substitutes like self-catering options)
Without that weighing and judgement, it tends to sit around Level 2 evaluation.
So why did ChatGPT give 17/25?
ChatGPT's feedback read well. It praised:
- Relevant knowledge
- Clear application to hotels
- The presence of context
- PED as an evaluative factor
All fair positives.
The issue is that it then converted those positives into a high level, even though the response didn't do enough of what Edexcel rewards in the top half of the mark range.
This is the trap. Raw chatbots are very good at recognising "this sounds like Economics", and much weaker at consistently applying the tougher constraints of a level-of-response mark scheme.
How to use chatbots safely (without misleading students)
If you're going to use a raw chatbot, I'd treat it as a writing coach, not a marker. A few ways to keep it useful and safe:
- Don't ask it for a mark first. Ask it: "What are the missing chains?" and "Which evaluation strands are underdeveloped?"
- Ask for one concrete upgrade per paragraph (rather than "improve the whole essay").
- Force it to be specific: "State the model shift, the direction, and the consequence."
- If it claims a high band, follow up with: "What would an examiner say is missing for Level 4/5?"
- Use it to generate possible evaluation angles, but you still decide whether the answer actually develops them.
Used that way, it can be genuinely helpful. Used as a marker, it can create false confidence.
A quick note on Teach Edge (and honesty)
We're not claiming Teach Edge is perfect. No marking system is, and exam marking itself has judgement in it.
The point is that Teach Edge is built and continuously refined to track the exam board's level descriptors rather than the "sounds-right" instincts of a general chatbot. It also aims to make the reasons for mark ceilings clear.
In this example, the chief examiner's 11/25 aligned with Teach Edge's 11/25 because the answer, while sensible, didn't develop the chains and evaluation far enough to climb levels.
That can feel harsh. But it's also where real progress starts.
The real takeaway
The question isn't "Does the feedback sound good?"
It's: "Does this feedback reflect how marks are actually awarded?"
Try Teach Edge
If you want AI feedback that's designed around exam-board marking (and a workflow where teachers stay in control), you can explore Teach Edge here:
Related Posts
The DfE says use AI responsibly. What does that mean in practice?
The DfE's AI guidance is sensible, but it's light on what do we do on a Tuesday night with 30 essays. Here's a practical playbook for responsible AI-assisted marking, plus a copy/paste one-page policy template.
An Introduction to AI Prompting for UK Teachers
AI prompting (sometimes called prompt engineering) is simply the skill of giving clear instructions so you get useful, teacher-ready output. This guide shows a practical, UK-teacher-friendly way to structure prompts that actually work.
Stop Trying to Catch Them: Why AI Detection is a Dead End for UK Secondary Schools
AI detection tools cannot reliably prove whether a GCSE or A Level student used generative AI. Schools will get further by modelling good AI use and protecting supervised writing time.
Ready to transform your marking workflow?