AI Education

Can you trust AI detection tools?

Written by James Smith | Apr 15, 2024 1:28:42 AM

In the 90s, university students had to do a lot wrong to be busted for plagiarism. They’d have to copy text out of a book, essentially verbatim, or be caught cheating directly off another student. They also had to work very hard to access any kind of information — meaning they had fewer opportunities and sources for plagiarism.

The internet changed all that. Information was suddenly freely available, making it a lot more tempting for students to copy text. But, if you did, there were systems that could pick up on it very easily.

It was all pretty cut and dry.

Now, with AI-powered search engines, access to information is even easier than before. But the correlation between information accessibility and plagiarism stands — so we find ourselves in a world where students can cheat in ways they never could before. Our students now have access to technology that can generate original work for them.

Teachers around the world are now leaning on AI detectors.

If students are using AI to do their homework, then we, as teachers, need a way to pick up on it. AI detectors go some way in attempting to meet this need, having been designed to discern whether a piece of text is generated by artificial intelligence, such as ChatGPT.

These detectors typically employ language models that are similar to, or even the same as, those used by AI writing tools. They analyse the text to determine if its characteristics — style, complexity, and patterns — align with those typically produced by AI. It also looks for signs of consistency and predictability that might not be as prevalent in human-written text.

There are two primary factors that form the basis of AI detection: perplexity and burstiness.

Put simply, perplexity measures the predictability of a text, while burstiness assesses the variation in sentence structure and length.

AI-generated texts tend to exhibit low perplexity, meaning the words choices are predictable and less likely to perplex or confuse the average reader. This low perplexity is the result of the very basics of AI functionality: it selects its next word based on what the training data says ’should’ come next.

In contrast, human writing displays much more variance and, thus, higher perplexity — word choice can be unexpected, unusual narrative decisions are made. Rather than picking the most likely next word, humans might pick the 252nd most likely next word.

The two restaurant reviews below illustrate the difference between low and high perplexity text.

AI-Generated Review:

"The new Italian restaurant in town offers a delightful dining experience. Their pasta dishes are cooked to perfection, with flavors that are rich and satisfying. The ambiance of the restaurant is cozy and welcoming, and the staff provides excellent service. Overall, this is a great place for a family dinner or a romantic evening out."
 
Human-Written Review:
 

“The minute you walk in to the town’s latest Italian restaurant, you want to love the food. They could serve up packet pasta in a pre-made sauce and the chaotic, colourful, back-street-in-Italy ambience would almost do enough to convince me that I was eating Nonna’s secret recipe. Luckily — for me and for your next night out — the food was every bit as magical as the experience.”

Wording sequences such as “…up packet…” and descriptors like “…back-street-in-Italy…” are examples of unpredictable text, that AI would be unlikely to generate.

On the other hand, burstiness in text — particularly in the context of AI detection — refers to the variation in sentence structure and length within a piece of writing. This concept is crucial in differentiating between AI-generated and human-authored texts.

AI will typically produce uniform sentence structures of similar lengths and will, importantly, adopt a tempo that is consistent and regular in a way that human writing is not. Humans write in lulls and bursts, shifting emphasis within a sentence, within a paragraph. Newer AI models know better than to mimic sentence structure over and over in one paragraph, but still tend to produce writing that is tonally consistent.

The restaurant reviews above illustrate this idea of tempo regularity. The voice of the AI review is reasonably monotone and consistent throughout the piece, while the human-written text bounces between short and long sentences, slower and faster pacing.

The analysis of these elements allows AI detectors to evaluate and score texts on their likelihood of being AI-generated, providing a crucial tool for users needing to assess the origin of the content.

It’s imperative we understand the reliability and limitations of AI detectors.

AI detectors are relatively new and still in an experimental phase. As such, their reliability varies, with some tools achieving a degree of accuracy but often struggling to consistently differentiate between human and AI-generated texts. One AI detector even alleged that most of the American Constitution was generated by Artificial Intelligence. This means that either time travel is real or AI detection tools are not completely accurate.

Predictability and consistency in text is not necessarily a bad thing — and, in some forms of writing, is even ideal.

Attributing these characteristics to AI writing alone can therefore lead to incorrect detections. This is especially true when analysing texts that have been deliberately crafted to be less predictable or have undergone significant editing post-generation.

This creates the risk of false positives (identifying human-written text as AI-generated) and false negatives (failing to identify AI-generated text).

As people come to understand how detectors work, their ability to evade detection will improve.

AI-generated texts that have been modified by humans inherently display elements of unpredictability and variability — factors relied on by detectors in assessing authenticity.

And, I hate to tell you, but the students are all over it.

I was speaking with a parent the other day that watched their kid generate their history homework using ChatGPT, and then put that text through a tool that altered the phrasing and word choices to match their own ‘style’.

There are a few key strategies to edit and paraphrase AI-generated text to evade detection by platforms such as ZeroGPT:

  1. Varying sentence structure, length and format to create a natural rhythm.

  2. Using synonyms to introduce variability and move away from AI’s vocabulary (for example, AI loves the words “delve” and “paradigm”).

  3. Avoiding repetitive keywords and phrases to increase perplexity.

  4. Introduce a conversational tone that mimics human speech.

  5. Include personal anecdotes and perspectives to add a layer of authenticity.

  6. Instruct the AI with descriptive prompts to generate more specific, nuanced content — leading to outputs that are less generic.

  7. Using paraphrasing tools to restructure AI-generated content in ways more typical of human rewriting and editing processes.

So, of course, AI detection accuracy is limited.

Research indicates that even the best AI detectors have significantly varied accuracy rates. Premium tools may offer higher accuracy, but no tool currently guarantees complete certainty.

One study tested the accuracy of GPTZero using 50 different pieces of text — 20 of which were written by ChatGPT, 30 of which were human written.

The findings were interesting. GPTZero was quite good at recognising texts written by humans — it got it right 90% of the time. However, when it came to identifying texts written by AI, it wasn't as accurate. It only correctly identified AI-written texts about 65% of the time. Overall, GPTZero correctly classified the texts 80% of the time.

This is important for us to know as teachers because it means that — sometimes — GPTZero might think a text written by AI was actually written by a human.

Based on the current standards of detection, we should therefore be using AI detectors as indicators rather than definitive proof of AI authorship.

As with everything in this space, the technologies and systems at our disposal will only improve — therefore, detection will likely get better.

For the last few years, OpenAI has claimed to be working on a way to watermark text generated by its AI models, such as GPT-3.5 or GPT-4.

The watermarking method being explored is a cryptographic pseudorandom function that adds an unnoticeable signal to the text — helping to identify AI-generated works. Of course, workarounds to this may inevitably be found, but it could certainly help.

The specific details and implementation of the watermarking technique are not publicly available, and it is not actually clear whether the current or future versions of GPT have this feature.

The ability to detect AI writing manually is a skill that teachers need to hone.

More than ever before, a teacher’s understanding of a student’s skill level and progression is key. In the same way that we have protected against plagiarism in the past, we need to remain alert to vast inconsistencies in performance, and we need to engage student in-class to check their understanding.

When reviewing student work, teacher’s should check for:

  • Unusual capitalisation and American spellings (if not common for the student).

  • A lack of sentence structure variation

  • Overly verbose language.

  • An overly polite and formal style.

  • Over-use of language that ‘hedges’ — phrases like "It's important to note that..." or "Some might say that...", which possibly indicate a lack of bold, original statements.

  • Inconsistency in voice compared with the student’s usual style.

  • Strange repetitions or patterns that seem unnatural.

  • Overly close engagement with the wording of the prompt (what the student originally typed in may feature very heavily throughout the piece).

  • A monotonous writing style.

  • Significant differences between the quality of the current assignment and the student’s past submissions.

This review needs to be supported by in-class discussion that tests a student’s understanding. As teachers, we can:

  • Have a conversation with the student about their work.

  • Discuss the main ideas, choice of words, phrases, and argument development.

  • Evaluate their understanding and ability to explain concepts presented in their writing.

Automated detection is likely to always lag behind AI development — so we need to develop our capacity to pick up on inconsistencies, and investigate.

Understanding AI is important. If students are using it, we need to understand what tools they are using and how.

One way to do this is to experiment with AI writing tools yourself. The more you interact with AI and observe the type of texts it generates, the more you will come to know its style, its tone and its voice. In the same way that you could pick out the writing of a kid in your English class, you will eventually become familiar with the writing of AI.

The truth is, AI is here to stay. It will be a large part of many workplaces when today’s students hit the workforce — so their ability to use it is not a bad thing. Our role is to make sure that AI is used in an academically responsible way, and that students are not losing other skills due to AI dependency.