I ran a blind experiment with 50 participants, 20 text samples, and one question: was this written by a human or an AI? The results challenged everything I assumed about how detectable machine-generated text actually is.
Why I Ran This Experiment
Last fall, I got into an argument at dinner about whether anyone can actually tell AI writing from human writing. A friend who teaches college English insisted he could spot ChatGPT output “within two sentences.” A software engineer at the table said the distinction was already meaningless. A freelance journalist said she could tell but most people could not. Everyone had an opinion. Nobody had data.
So I decided to get some. Over three weeks in January 2026, I assembled 50 participants and showed each of them the same 20 text samples. Ten were written entirely by humans. Ten were generated by AI models — GPT-4o, Claude 3.5 Sonnet, and Gemini 1.5 Pro. The participants had to label each sample as “human” or “AI.” No time limit. No tricks. Just read and decide.
This was not a rigorous academic study. I did not control for every variable, and the sample size is modest. But the results were consistent enough, and surprising enough, that they are worth reporting honestly. They also align with what published research has found at larger scales, which gives me confidence the patterns are real.
A 2024 study published at the ACM Conference on Fairness, Accountability, and Transparency found that people could not distinguish GPT-4 from a human in a Turing test format — GPT-4 was judged to be human 54% of the time, approaching the 67% rate for actual humans. My results tell a similar story, with some additional wrinkles about who does better and why.
How the Experiment Worked
I recruited 50 participants through personal networks, social media, and a local coworking space. I deliberately sought a mix of backgrounds: 14 worked in technology or software, 9 were writers or editors, 8 were academics, and 19 came from fields unrelated to either writing or technology. Ages ranged from 22 to 61. All were native English speakers.
The 20 text samples covered five genres: news reporting, opinion essays, product reviews, personal narratives, and technical explanations. Each genre had two human-written samples and two AI-generated samples. Every sample was between 280 and 350 words. I edited the AI outputs lightly for factual accuracy but did not alter their style, tone, or sentence structure. The human-written samples came from published sources that were unlikely to appear in AI training data — university newspapers, small-circulation magazines, and personal blogs.
Participants completed the task individually on a simple web form. They could read each sample as many times as they wanted, then selected “Human” or “AI” and optionally typed a short explanation of their reasoning. The whole exercise took most people between 25 and 40 minutes.
The key metric was accuracy: what percentage of the 20 samples did each participant label correctly? Random guessing would produce 50%. A perfect detector would score 100%.
The Results, By the Numbers
The average accuracy across all 50 participants was 52.4%. That is barely above chance. Put another way: the typical person in my experiment would have done almost exactly as well by flipping a coin for each sample.
But the average masks a wide distribution. Scores ranged from 35% (seven out of twenty correct — worse than random) to 80% (sixteen out of twenty). The distribution was not normal. Most people clustered between 45% and 60%, with a small group performing significantly better.
Here is where it gets interesting. When I broke the results down by background, clear patterns emerged.
| Participant Group | Count | Avg. Accuracy | Best Score | Worst Score |
|---|---|---|---|---|
| Writers and editors | 9 | 58.9% | 80% | 45% |
| Frequent AI users (daily) | 11 | 63.2% | 80% | 50% |
| Tech workers (non-AI) | 14 | 51.8% | 65% | 35% |
| Academics | 8 | 50.6% | 60% | 40% |
| Other professions | 19 | 48.4% | 65% | 35% |
| All participants | 50 | 52.4% | 80% | 35% |
The single strongest predictor of accuracy was not profession, education, or age. It was how frequently the participant used AI tools themselves. The 11 people who reported using ChatGPT, Claude, or similar tools daily averaged 63.2% accuracy. Everyone else averaged 49.1%. This aligns with a January 2025 study from Cornell that found frequent ChatGPT users are “accurate and robust detectors” of AI-generated text, with a majority vote among five expert annotators misclassifying only 1 out of 300 articles.
The people who use AI every day have developed an intuition for its patterns. They know what the defaults feel like. Everyone else is guessing.
What Gave AI Writing Away (When Anything Did)
I asked participants who scored above 60% to describe their reasoning. Their explanations revealed a consistent set of tells — subtle patterns that AI text tends to exhibit and that attentive readers can learn to spot.
Uniform sentence rhythm. Several high-scoring participants noted that AI text tends to alternate between medium and long sentences in a predictable cadence. Human writers are messier. They write sentence fragments. They start three sentences in a row with “I.” They let paragraphs run too long or cut them oddly short. AI text reads like it was produced by someone who internalized every style guide but never broke a rule on purpose.
Hedging that sounds diplomatic rather than uncertain. When AI models qualify a statement, they use phrases like “it is worth noting that” or “while this is generally true, there are exceptions.” Humans hedge differently. They say “I think” or “I’m not sure about this, but” or just state the thing without qualification and let the reader decide. AI hedging feels institutional. Human hedging feels personal.
Perfect logical structure without digressions. AI-generated opinion pieces followed a clean thesis-evidence-conclusion arc. Human opinion writing wanders. It makes asides. It circles back to a point from three paragraphs ago. One participant put it well: “The AI pieces felt like they were written by someone who never gets distracted. Nobody writes like that.”
Absence of specificity that could be verified. The AI-generated personal narratives described experiences in general terms. “A friend mentioned an interesting restaurant” rather than “My friend Sarah told me about this ramen place on 43rd Street.” Human writers anchor stories in concrete details because they are drawing from memory. AI generates plausible details that sound generic because they are synthesized from patterns rather than recalled from experience.
The genre breakdown revealed an important nuance. Personal narratives were the easiest to classify correctly (61% average accuracy) because the absence of genuine personal detail was noticeable. Technical explanations were the hardest (44% — worse than guessing) because AI produces competent, well-structured technical writing that is indistinguishable from what a knowledgeable human would produce. News reporting landed at 49%, essentially random, because the conventions of news writing are so rigid that both human and AI outputs look formulaic.
The False Positive Problem
Here is the finding that surprised me most. Participants did not just fail to catch AI writing. They also accused human writers of being AI at an alarming rate.
Across the ten human-written samples, 39% of all labels were wrong — participants marked genuine human writing as AI-generated. The rate was highest for the human-written technical explanation (54% mislabeled) and the human-written news report (48% mislabeled). Any human writer who produces clean, well-structured prose is now at risk of being falsely flagged.
This mirrors what automated AI detectors get wrong. GPTZero, one of the most widely used detection tools, had a 16% false positive rate in a 2025 study — flagging human-written essays as AI-generated. The consequences are not theoretical. Students have been accused of cheating on assignments they wrote themselves. Freelance writers have lost clients over false AI detection flags. A Nature commentary from 2024 warned that the bias is particularly harsh against non-native English speakers, whose careful, grammatically correct prose triggers AI detectors more often than the informal writing of native speakers.
The fundamental problem is that the qualities we associate with “good writing” — clarity, logical structure, correct grammar, smooth transitions — are the same qualities AI produces by default. When a human writes something polished, it looks like AI. When AI writes something, it looks polished. The overlap zone is enormous, and it grows wider as the models improve.
What This Means Going Forward
My experiment was small, informal, and limited to English text. I am not claiming it proves anything definitively. But it adds to a body of evidence suggesting three things that matter for how we think about AI-generated content.
First, the general population cannot reliably detect AI writing. Not because people are careless, but because the models have gotten good enough that the output falls within the normal range of human writing quality. My dinner friend who “can spot ChatGPT in two sentences” participated in the experiment. He scored 55%.
Second, detection ability correlates with AI fluency, not writing expertise. Professional writers did better than average, but frequent AI users did better still. The best detectors were people who use these tools every day and have developed an intuitive sense for the models’ default patterns. This suggests that detection is a learnable skill, but it requires exposure to AI output, not just writing ability.
Third, false positives are as dangerous as false negatives. Almost 40% of the time, participants flagged human writing as AI-generated. In a world where AI detection increasingly carries consequences — academic penalties, content moderation decisions, hiring judgments — the cost of false accusations is real and underappreciated. We are building systems that punish good writing by assuming it must be machine-generated.
The implication is not that we should stop trying to distinguish AI from human writing. It is that we should be honest about the limits of that distinction. Declaring something “definitely AI” based on a gut feeling, a detector tool, or a standardized test is less reliable than most people assume. As the models continue to improve, the gap between AI and human prose will keep narrowing, and any system that depends on reliably telling them apart will eventually fail.
Maybe the better question is not “who wrote this?” but “is this accurate, useful, and honest?” That question has always been the right one. We just forgot about it while we were distracted by the authorship debate.
Frequently Asked Questions
It depends on the tool and the context. The best commercial detectors like Originality.ai claim accuracy rates above 95% on unedited AI output, but performance drops significantly on text that has been paraphrased, lightly edited, or generated with humanization prompts. GPTZero had a 16% false positive rate in controlled testing. Meanwhile, the best human detectors in my experiment — frequent AI users — achieved 63% accuracy, which is lower than tool accuracy on clean AI text but comes with far fewer false positives on human writing. The honest answer is that neither humans nor tools are reliable enough to be the sole basis for consequential decisions.
The data suggests yes. The strongest predictor of detection ability in both my experiment and published research is frequent, hands-on use of AI writing tools. People who generate AI text daily develop an intuitive sense for the models’ default patterns — the rhythm, the hedging style, the structural predictability. Reading AI output critically and regularly appears to build this skill faster than any formal training program. That said, as models improve, the tells become subtler, so this is an arms race where the detector has to keep learning.
Not as the sole measure. The false positive rates of both human judgment and automated tools are high enough to guarantee false accusations. A more defensible approach combines process-based assessment — observing drafts, requiring revision histories, conducting oral examinations — with detection tools used as one signal among many rather than a definitive verdict. Several universities have moved in this direction after high-profile cases where students were wrongly accused based on detector output alone. The technology is useful as a screening flag, not as a judge and jury.