The experiment looked almost like a prank: give an official baccalauréat essay question to OpenAI’s chatbot, hand the paper to a real teacher, and wait for the verdict. The copy that came back seemed clean, organised and confident. The mark did not match that first impression.
A perfectly formatted essay that rings hollow
The test was organised by regional channel France 3 Hauts‑de‑France during the 2025 philosophy baccalauréat. Journalists picked a genuine exam prompt: “Is truth always convincing?” ChatGPT was instructed to answer as if it were a French sixth‑form pupil aiming for a solid pass.
On the page, the result ticked every formal box. The chatbot produced a neat introduction, a three‑part development, and a conclusion. Sentences were fluid. Spelling was flawless. Markers such as “firstly” and “on the other hand” were placed exactly where a teacher would expect them.
Seen from afar, ChatGPT’s paper looked like the dream copy of a nervous candidate: tidy, articulate and reassuringly structured.
The journalists then sent the anonymous essay to a philosophy teacher, without revealing that it came from an AI. The paper was marked like any other end‑of‑school exam script. Once the red pen had done its work, the illusion crumbled: ChatGPT scored only 8 out of 20, well below the average usually needed to pass.
The teacher’s comments focused less on style than on substance. Beneath the polished surface, the reasoning was judged shallow, repetitive and strangely mechanical. The copy felt as if it knew what a philosophy essay should look like, but not what it should actually say.
When the question quietly changes meaning
The first major issue flagged by the marker concerned the handling of the question itself. The original prompt – “Is truth always convincing?” – asks whether truth, by its nature, has the power to persuade. ChatGPT subtly shifted this to another problem: “Is truth enough to convince?”
That small twist matters a lot in an exam context. In French philosophy marking, reformulating the question precisely is almost half the battle. It shows that the student has grasped the tension behind the wording.
By bending the question, the chatbot avoided part of the difficulty – and lost points for drifting away from the examiner’s intent.
➡️ Why your retirement dreams may be quietly sabotaged by the one ‘safe’ habit you refuse to question
➡️ How salt can help scrub greasy pans without damaging the surface
➡️ Rodents flee instantly: the overlooked staple that drives rats away without traps
➡️ Scratches on glass-ceramic cooktops: removal in four simple steps
➡️ Archaeologists uncovered a perfectly preserved Roman bathhouse under a modern parking lot
Once the topic changed, even slightly, the rest of the essay followed the wrong path. The arguments no longer answered the exact demand. For a human pupil, that kind of slip usually comes with hard‑earned awareness: they sense something is off. An AI model does not feel that discomfort; it simply keeps generating text that “sounds right”.
A visible plan and invisible thinking
The teacher also criticised the essay’s structure. On paper, the organisation was impeccable: clearly separated sections, introductory phrases, and a final “opening” to related issues. In practice, the plan felt like a template applied from the outside rather than the result of inner thought.
Each paragraph looked like a self‑contained block, with little genuine progression from one idea to the next. Transitions were formal, not logical. The marker described a sequence of points more than a flowing argument.
- Thesis: truth should convince by definition
- Antithesis: truth sometimes fails to persuade
- Synthesis: other factors play a role in persuasion
This classic three‑step structure is often taught in French schools. ChatGPT reproduced it almost too faithfully, as if ticking boxes in a manual. What was missing, according to the teacher, was the personal way a student usually bends or reorders that framework when they genuinely wrestle with a problem.
Examples without depth, concepts without definitions
Another weakness concerned the handling of philosophical notions. The essay mentioned ideas such as “truth”, “opinion” and “reason”, yet barely defined them. In a philosophy exam, clarifying these terms is a central task. It shows that the candidate understands that concepts are not just words, but tools with precise contours.
The AI dropped references and examples like name‑checks, without stopping to unpack what they meant or how they really supported the argument.
According to the correction, examples were often generic, sometimes cliché. They were placed at the end of paragraphs like decorative proof, not examined in detail. A human pupil, even a struggling one, tends to linger on an example that speaks to them – a personal anecdote, a news story, a film. That small detour can give their copy a distinctive tone. The chatbot’s essay sounded interchangeable with thousands of others it could produce on command.
What this tells us about current AI limits
This is not the first time AI systems have been asked to sit school exams. Language models have already produced essays for UK GCSEs, American college assignments and various national tests. Often, they score somewhere around the pass mark in content‑heavy subjects, and higher when the marking favours formal clarity over originality.
Philosophy presents a tougher test. The discipline rewards doubt, hesitation and conceptual risk‑taking. It asks the candidate to question the question itself, to point out ambiguities or hidden presuppositions. ChatGPT can imitate this attitude with phrases that sound reflective, but the teacher who marked the copy felt no genuine interrogation behind the words.
The result underlines a structural limit. Large language models are trained on patterns in text. They are experts in producing coherent sequences of sentences. That skill aligns well with the “essay format”, but not necessarily with the underlying activity of thinking. The model connects phrases that often go together in its training data. It does not check those connections against a lived experience of doubt or discovery.
Why good writing is not enough in philosophy
The experiment also highlights a tension teachers already face with human pupils: the gap between style and thought. Some teenagers master rhetorical tricks, transitions and introductions. They know how to sound serious. Yet their essays can feel empty once you read beyond the first page.
Philosophy teachers do not just assess how well students write; they look for a mind at work – hesitating, correcting itself, pushing an idea further.
ChatGPT nailed the outer shell of that performance, not the inner movement. Its essay delivered a safe, balanced answer, carefully avoiding strong claims that might be wrong. That strategy often keeps marks from plummeting, but it rarely leads to the high grades awarded to bold, well‑argued copies.
The teacher who graded the AI suggested that an average sixth‑form pupil, even an anxious one, might have done better. A teenager can lean on intuition: a vague feeling that the question “truth is always convincing” clashes with everyday experience of lies, manipulation or stubborn denial. From there, they can build arguments shaped by their own encounters. The chatbot has no such background, only text it has statistically absorbed.
What “8 out of 20” means in the French system
For readers outside France, the grade itself deserves a quick explanation. The baccalauréat is marked out of 20. A 10 usually means a basic pass. Marks between 12 and 14 are considered decent. From 16 upwards, you enter the territory of very strong copies.
| Score /20 | Rough meaning in philosophy marking |
|---|---|
| 5 or below | Misunderstood question or almost no argument |
| 8 | Some structure and ideas, but weak grasp of the problem |
| 10–12 | Correct, conventional copy with clear, limited reasoning |
| 14–16 | Strong analysis, relevant references, clear personal stance |
| 17–20 | Rare scripts that combine rigour, originality and depth |
With an 8, ChatGPT would probably not fail the entire exam, thanks to compensations from other subjects. Yet in philosophy, where many students aim at least for a respectable 10 or 12, it would not be seen as a success story.
Implications for students tempted to “outsource” their essays
The France 3 experiment lands at a tricky moment for schools. Teachers across Europe and North America already suspect some pupils are using AI tools to draft homework or even take‑home exams. The idea of asking a chatbot to handle a philosophy essay is understandably tempting for a teenager staring at a blank page.
This case sends a mixed signal. Yes, ChatGPT can produce something that looks like a decent copy in seconds. No, that does not guarantee a good grade when a specialist reads closely. More than that, relying on such help carries risks that go beyond marks.
- Students may stop practising the slow, frustrating work of building their own arguments.
- They can lose confidence in their ability to write imperfect but genuine texts.
- Teachers may respond by tightening surveillance, eroding trust in the classroom.
Some educators suggest a middle path: treating AI as a brainstorming partner rather than a ghostwriter. A pupil might ask a chatbot for definitions of “truth” in different philosophical traditions, then use that information critically, checking sources and building their own stance. In that scenario, the mark reflects how they select, adapt and challenge what the tool offers.
Beyond the bac: what counts as “thinking” for machines?
The modest 8 out of 20 score also feeds a broader debate about artificial intelligence. When people say “ChatGPT can think”, they often mean that it produces text that resembles thought. The baccalauréat essay reminds us that looking like thought and actually thinking are not the same thing.
To make that distinction clearer, some researchers use the terms “syntactic” and “semantic”. Syntactic abilities deal with form: grammar, structure, typical phrases that seem logical. Semantic abilities concern meaning: how ideas connect to reality, experience and action. Large language models excel syntactically. Their semantic grip is more fragile, especially in areas like philosophy, where reality is not just physical but conceptual.
Future AI systems may narrow that gap, perhaps by integrating other kinds of data or reasoning modules. For now, a 2025 French teacher with a stack of philosophy essays in front of them can still tell the difference between a teenager wrestling with a question and a chatbot arranging familiar sentences. The red pen, at least for the moment, remains stubbornly human.
Originally posted 2026-03-09 04:28:00.
