April 15, 2024

What are large language models like ChatGPT?

By James Smith · 4 minute read

Large Language Models (LLMs), exemplified by systems like ChatGPT, are transforming the landscape of digital learning and interaction. But while most of us know what ChatGPT is, we couldn’t answer too many questions around how it does what it does.

So… what exactly are LLMs, and how do they learn to mimic human-like language so effectively?

LLMs are essentially sophisticated machine learning models that are trained on extensive volumes of text data.

Their primary objective is to learn and replicate the patterns, structures, and nuances of human language.

For teachers and students, this means having access to a tool that can understand queries, generate informative content, and even engage in interactive dialogue, all in a manner that closely resembles human interaction.

LLMs are trained by being exposed to vast amounts of text that encompass a wide array of subjects, styles, and structures.

One such LLM, Meta's LLama 2, was pre-trained on publicly available online data sources, including webpages scraped by CommonCrawl, open source repositories of source code from GitHub, Wikipedia in 20 different languages, and public domain books from Project Gutenberg. The model was primarily trained on English with additional data from 27 other languages.

This extensive training enables the models to understand context, grasp idiomatic expressions, and even capture the subtleties of tone and sentiment.

This form of training, and the concept of using LLMs more generally, essentially evolved from the realisation that the step-by-step training of individual models for individual use cases was not the most efficient use of AI technology. Instead, it was decided that a single, large, pre-trained model should be created — that could then be ‘fine-tuned’ to different needs.

LLMs are designed to not just process text, but to generate it in a way that feels natural and contextually appropriate — or, if you will, ‘human-like’.

This involves complex algorithms that predict the next word in a sentence based on the preceding context, a process that mirrors how humans construct language.

The result is an AI model that can write essays, explain concepts, or engage in dialogue with a level of fluency and coherence that mirrors human writing.

Understanding the architecture and learning methods of LLMs help grasp the possibilities — and limitations — of this technology.

The superiority and sophistication of ChatGPT lies in its foundational architecture and the advanced learning methodologies it employs — or, put simply, in the way it works.

ChatGPT’s models are built upon the transformer architecture, a deep learning model that has revolutionised natural language processing.

Unlike its predecessors — such as Recurrent Neural Networks (RNNs), which process text sequentially — the transformer model employs a mechanism known as "self-attention," which allows the model to weigh the importance of each word in a sentence, regardless of its position.

What this means is that newer models can consider the context of words holistically. They have a better ‘memory’ and a more nuanced understanding of language, and can capture both nearby and faraway connections between words in an effective way.

The concepts of tokens and training data are central to the inner workings of AI.

To LLMs, text is ‘tokenised’ — meaning it is broken down from a string of text into a series of tokens. The primary goal of tokenisation is to represent text in a way that is meaningful for machines without losing its context.

Tokens can range from a single character to a whole word, depending on the context and the structure of the language.

For example, the sentence "What restaurants are nearby?" can be tokenised into individual words or subwords like "What," "restaurants," "are," and "nearby." In more complex cases, tokenisation can involve breaking down words into smaller parts, such as "tokenisation" into "token" and "isation."

Once the text is tokenised, ChatGPT's primary task is to predict the most relevant next token or series of tokens based on the context provided by the preceding tokens. In other words, it guesses what ‘should’ come next.

For example, if the model was asked to complete the sentence “the cat is on the”, it would use the existing tokens to predict the next word. It may, in this example, predict the next word to be “mat”. The predicted token “mat” would then be considered by the model in generating the next word.

In this way, the model generates text word by word, ensuring that each token is coherent and contextually relevant to the preceding tokens.

So, how does the model predict what token should come next?

In short, it has been trained on vast amounts of human-written text. Books, articles, websites. It has all be used to informed ChatGPT’s understanding of language patterns, trends, regularities, and conventions.

What this means is that LLMs are limited by the quality and accuracy of the information they have been trained on.

As educators, we need to be cognisant of these limitations and the ethical considerations surrounding them.

The bias of training data is one such limitation that must be understood. If bias exists in the information being ingested by models such as ChatGPT, then the outputs it presents will show this same bias.

For example, given the history of the data it has been shown, ChatGPT will consider the image of a leader to be male. OpenAI is actively identifying biases such as this, and implementing safeguards to re-educate the model — but the underlying data remains.

The stereotypes and inequality faced by humans exist for artificial intelligence too.

Accuracy of information is also a concern. If the training data tells ChatGPT ten times over that cows are pink, with only one source saying they are black and white, ChatGPT will tell you they are pink. It favours volume — it cannot tell the difference between fact and fiction, unless explicitly told.

In the classroom this, of course, is critical. The generation of misleading or factually incorrect information — referred to as 'hallucinations' in AI parlance — can inadvertently perpetuate information that is inaccurate or inappropriate, and is a risk that teachers and students alike need to understand.

The moral is: think critically.

Just as we are taught in school, you need to verify your information with more than one source. ChatGPT cannot be used as a standalone resource.

LLMs must be considered a tool to enrich learning — not to replace independent thought or analysis.