ChatGPT was truly a revolution. But why? What was different about it, and how did the underlying technology come about? This guide explains the history, the advancements and the technology in plain language so that anyone can understand it, and therefore better understand what’s coming next.
As usual, all images generated with Midjourney.
A.I. has been around for a long time. Actually the term “artificial intelligence” was coined in 1956 at a conference that predicted that human-level intelligence would be replicated by a machine in no more than a generation. Not only did that not happen, but actually progress in AI has been glacially slow. And people have been wrongly predicting that humans would be succeeded by machines for more than a century, even as far back as 1863 when Samuel Butler wrote:
…we are ourselves creating our own successors; we are daily adding to the… self-regulating, self-acting power which will be to them what intellect has been to the human race. In the course of ages we shall find ourselves the inferior race.
That wasn’t just some Sci-fi fan-fiction - the theoretical underpinnings for neural networks were developed at that time, and the first operational Recurrent Neural Network was developed based on magnetism in 1925.
In the 1950s, the first couple of AI-based computer programs were developed. One was programmed to prove mathematical theorems and the other was programmed to play checkers. Over the decades since, there has been very incremental progress in the various branches of AI.
What you’ll notice about the more successful applications of AI is that they’ve been extremely specific or “narrow”. For example, Deep Blue beat the human chess champion Garry Kasparov in 1996, and then AlphaGo beat the human Go champ - though not for another 20 years.
So although the predictions for an AI that could do multiple things better than a human (a “strong” AI) have been around for many many decades, there has been literally no evidence of that actually coming true.
Until now.
And the evidence for a machine that could do multiple things better than humans appeared not in the domain of board games, but in the domain of language.
This is surprising, because human language is actually really hard for computers. There are so many ambiguities and nuances that we take for granted. For example, let’s look at a pretty simple sentence in English and think about how we could translate it to another language:
Even though she was really tired, Sarah couldn’t sleep because the dog was barking.
Translators work by processing a sentence sequentially, word by word. But it’s not good enough to just substitute each word with the corresponding word in the target language. So some approaches have been to identify groups of words or phrases and replace those, but that doesn’t work incredibly well, either.
What’s really challenging is context. So in a sentence like this, what does “she” mean? In this case it’s referring to Sarah, but these words are apart, and actually the meaning of “she” is delayed, so the translator needs to hold “she” in memory as an unresolved reference, and hope it can connect it with something coming later.
So processing text sequentially like this, word by word, and collecting up these unresolved semantics, is very memory and processor intensive, and it gets worse the more distance there is between semantically related words, and the broader the context you want to include. Of course, when it comes to translation, more context is better.
So after decades of R&D, translation between languages remained really hard because of the incredible challenges of context.
Google Brain had been doing a lot of research into how AI - specifically deep learning neural networks, like the ones used to master chess and Go - could help with language translation. In 2017 - so that’s the year after the Go champion was beaten - they published a paper called “Attention is All You Need” which introduced what they called the Transformer architecture.
What they proposed was an approach with two significant differences:
The way they trained their network was this: Instead of training sequentially word-by-word, they would break the text up into groups of words or sentences and process them as a chunk. The pieces of text would be converted into numbers, and then those numbers would be batched up into a structure that allowed them to be very computationally efficient.
Their approach to the context problem was this idea of Attention. Because they processed text in batches, each word was in context. And as the training progressed, the model would start to see each word in different contexts. Given enough examples, the model could start to build an understanding of the semantics of words, just from their context. So this is what the paper’s title is referring to: “Attention is all you need” - their breakthrough idea was that actually, given sufficient examples, you can build up semantic meaning just from context, and nothing else. That was this team’s breakthrough insight.
Okay, so the team tested out their new Transformer model on some large English-to-German (4.5M) and English-to-French (36M) sentence-pair examples, and they achieved state-of-the-art results on the relevant translation benchmarks. So hooray. But unless you were a data scientist working in language research, chances are you wouldn’t have heard of this paper at the time.
For people in the language field, the transformer architecture was a promising development, and a number of groups took the research and moved it forward in various ways, including Google themselves.
But one company was OpenAI, who had been around for a couple of years and built a few different products, and did some research of their own, but nothing that set the world on fire. But when they did their own further research on transformers in 2018, their approach would turn out to be very consequential. We’ll get to what they did differently in a second.
But first, remember how the original transformer was trained on English-to-German and English-to-French sentence-pair examples? This is what is called labelled data, which is manually created and has to be perfect. This is obviously difficult and it takes a long time. And the data you make is going to be application specific - so those datasets for translation are good for training a model to be good at translation, and nothing else. If you want to do some other language task - like sentiment analysis - you’ll need to train models with datasets you’ve made for those specific applications. So creating good models for different applications is constrained by the volume and quality of labelled data you can get or make. Though, as the original Transformer paper showed, if you can provide enough data and training, then the model you end up with can do that one thing better than anything else available.
So what did OpenAI do differently?
OpenAI’s breakthrough was to split the training into two stages. The first stage was to train the model on unlabelled data. So not English-German translation pairs, or question-answer pairs, just plain text. But an absolute mountain of it. They used a dataset called BookCorpus, which is 7000 books or around 4.5 GB of text.
So if you think about that, it sounds odd. I can understand how something might learn when I give it a bunch of questions and corresponding correct answers. But how does a model learn anything from plain text? What constitutes a question and a correct answer?? We’ll get to that in a second.
So they called this first stage “pre-training”. But then they did a second stage, which they called “fine-tuning”. This is where they took the pre-trained model and trained it on the labelled data for the specific application they wanted to use it for. So for translation, they took the pre-trained model and trained it on the English-German translation pairs. What they found, however, was that the pre-trained model needed drastically less training to achieve state-of-the-art results on the translation benchmarks. And if they fine-tuned it for question-answering, it needed very little training data and training time, and produced state-of-the-art results for that, as well. And the same thing happened for sentiment analysis and categorization.
Okay, so what’s going on here? What happens in this magical pre-training stage, that makes a model better at learning these use cases?
What this pre-training stage is doing is learning the meaning of words through their context. And not just words, but grammar, usage, slang, structure, narrative. All the elements and forms of language.
Essentially, what they get the model to do is try to predict the next word in a sentence, and then give it feedback on how it did. So they do this over and over again, with millions and millions of sentences, and eventually, through this training process, the neural network powering the predictions gets really good at predicting the next word. Now hopefully you can see how this is possible with unlabeled text - you can just feed sentences in word by word, and then show it the next word for it to compare with its guess. This way, you don’t need humans to come up with a whole bunch of questions and corresponding correct answers, and all of a sudden you have access to lots of training data.
So this training to predict the next word in a sentence is called generative training, and so when you have sufficiently pre-trained a Transformer model, you have what’s called a base model. So in 2018, OpenAI released their first Generative Pre-trained Transformer - or GPT - base model, which they called GPT-1.
So the GPT-1 base model was a real breakthrough. Here you had something - for the first time - that with minimal labelled data and training, could be state-of-the-art in multiple language tasks.
So naturally, they wanted to see what would happen if they gave it way more data, and so for GPT-2 they gave it tonnes and tonnes of text… at least 10 times more than before. And then for version 3, same again - basically all the text they could get their hands on… much more than 10x again.
What they found was that each iteration improved the performance of the model on multiple benchmarks. But what also happened was that as they did more pre-training, less fine-tuning was required. So let’s take a second to think about what this means: If I provide the model with more and more unlabelled data to learn about the context of things, it requires less and less specific examples in order to do specific things really well.
So when OpenAI were testing GPT-2 (somewhat, but definitely more with GPT-3), the researchers noticed that the models could behave in a way that was… unsettling to them.
For example, if they provided a new sentence in the form of a question, then the model would complete it with an answer. And because the models had ingested just about the entirety of human knowledge, the answers were sometimes incredibly insightful. When they tried to get the model to write a story - say, by prompting it with Once Upon a time… - it would write a story that was sometimes incredibly compelling. And when they tried to get the model to write a poem, it would write a poem that was sometimes incredibly moving.
The researchers were astounded by what they had created, but they had a lot of reservations about allowing public access to it. So they spent the next two years putting guardrails around it, and fine-tuning it for safe conversation. They created a new base model as GPT-3.5, and put a web chat interface on top of it and released it as a commercial product called ChatGPT in December 2022. They got 1 million people sign up to use it within 5 days.
And the rest, as they say, is history.
If you enjoyed this article, please follow us on LinkedIn for more like it.
Reach out about a problem you have, an opportunity you've
spotted, or an idea that inspires you.
We'd love to hear about it!