Turnitin says its AI cheating detector isn’t always reliable

This article is a preview of The Tech Friend newsletter. Sign up here to get it in your inbox every Tuesday and Friday.

Turns out, we can't reliably detect writing from artificial intelligence programs like ChatGPT. That's a big problem, especially for teachers.

Even worse, scientists increasingly say using software to accurately spot AI might simply be impossible.

The latest evidence: Turnitin, a big educational software company, said that the AI-cheating detector it has been running on more than 38 million student essays since April has more of a reliability problem than it initially suggested. Turnitin — which assigns a "generated by AI" percent score to each student paper — is making some adjustments, including adding new warnings on the types of borderline results most prone to error.

I first wrote about Turnitin's AI detector this spring when concerns about students using AI to cheat left many educators clamoring for ways to deter it. At that time, the company said its tech had a less than 1 percent rate of the most problematic kind of error: false positives, where real student writing gets incorrectly flagged as cheating. Now, Turnitin says on a sentence-by-sentence level — a more narrow measure — its software incorrectly flags 4 percent of writing.

My investigation also found false detections were a significant risk. Before it launched, I tested Turnitin's software with real student writing and with essays that student volunteers helped generate with ChatGPT. Turnitin identified over half of our 16 samples at least partly incorrectly, including saying one student's completely human-written essay was written partly with AI.

The stakes in detecting AI may be especially high for teachers, but they’re not the only ones looking for ways to do it. So are cybersecurity companies, election officials and even journalists who need to identify what's human and what's not. You, too, might want to know if that conspicuous email from a boss or politician was written by AI.

There have been a flood of AI-detection programs onto the web in recent months, including ZeroGPT and Writer. Even OpenAI, the company behind ChatGPT makes one. But there's a growing body of examples of these detectors getting it wrong — including one that claimed the prologue to the Constitution was written by AI. (Not very likely, unless time travel is also now possible?)

The takeaway for you: Be wary of treating any AI detector like fact. In some cases right now, it's little better than a random guess.

A 4, or even 1 percent error rate might sound small — but every false accusation of cheating can have disastrous consequences for a student. Since I published my April column, I’ve gotten notes from students and parents distraught about what they said were false accusations. (My email is still open.)

In a lengthy blog post last week, Turnitin Chief Product Officer Annie Chechitelli said the company wants to be transparent about its technology, but she didn't back off from deploying it. She said that for documents that its detection software thinks contain over 20 percent AI writing, the false positive rate for the whole document is less than 1 percent. But she didn't specify what the error rate is the rest of the time — for documents its software thinks contain less than 20 percent AI writing. In such cases, Turnitin has begun putting an asterisk next to results "to call attention to the fact that the score is less reliable."

"We cannot mitigate the risk of false positives completely given the nature of AI writing and analysis, so, it is important that educators use the AI score to start a meaningful and impactful dialogue with their students in such instances," Chechitelli wrote.

The key question is: How much error is acceptable in an AI detector?

New preprint research from computer science professor Soheil Feizi and colleagues at the University of Maryland finds that no publicly available AI detectors are sufficiently reliable in practical scenarios.

"They have a very high false-positive rate, and can be pretty easily evaded," Feizi told me. For example, he said, when AI writing is run through paraphrasing software, which works like a kind of automated thesaurus, the AI detection systems are little better than a random guess. (I found the same problem in my tests of Turnitin.)

He's also concerned that AI detectors are more likely to flag the work of students for whom English is a second language.

Feizi didn't test Turnitin's software, which is available only to paying educational institutions. A Turnitin spokeswoman said Turnitin's detection capabilities "are minimally similar to the ones that were tested in that study."

Feizi said if Turnitin wants to be transparent, it should publish its full accuracy results and allow independent researchers to conduct their own research on its software. A fair analysis, he said, should use real student-written essays on different topics and writing styles, and address failure on each subgroup as well as overall.

We wouldn't accept a self-driving car that crashes 4 percent — or even 1 percent — of the time, Feizi said. So, he proposes a new baseline for what should be considered acceptable error in an AI detector used on students: a 0.01 percent false-positive rate.

When will that happen? "At this point, it's impossible," he said. "And as we have improvements in large-language models, it will get even more difficult to get even close to that threshold." The problem, he said, is that the distribution of what AI-generated text and human-generated text looks like are converging on each other.

"I think we should just get used to the fact that we won't be able to reliably tell if a document is either written by AI — or partially written by AI, or edited by AI — or by humans," Feizi said. "We should adapt our education system to not police the use of the AI models, but basically embrace it to help students to use it and learn from it."

It's one of the scourges of online life: Have you ever been misled by what you suspect is a fake online review? I’m talking about the types of reviews you find on Amazon that recommend a product that falls apart after you buy it — or the type you find on Yelp that praises a doctor who turns out to have a totally icky bedside manner?

If you’ve got a story to tell about shady reviews, I would love to hear about your experience. Send an email to [email protected].

Help Desk is a destination built for readers looking to better understand and take control of the technology used in everyday life.

Take control: Sign up for The Tech Friend newsletter to get straight talk and advice on how to make your tech a force for good.

Tech tips to make your life easier: 10 tips and tricks to customize iOS 16 | 5 tips to make your gadget batteries last longer | How to get back control of a hacked social media account | How to avoid falling for and spreading misinformation online

Ask a question: Send the Help Desk your personal technology questions.