Watch Your Language, Part 2 | Machine Yearning 002
"I don't know what GPT-3 is, and at this point I'm too afraid to ask."
Hello and welcome! This is part 2 of an introductory series on language models like GPT-3.
Part 1 explores language models at a high level, with both fun and dangerous applications.
Part 2 (this post) introduces GPT-3 for anyone who's heard about it but has no idea what makes it special.
🔬 Long Take: GPT-3 & Me
You've all probably seen article after article hyping up this thing called "GPT-3." You may have seen headlines like "Why GPT-3 Heralds a Democratic Revolution in Tech" or "One Model to Rule Them All". You may even know that GPT-3 has 175 billion parameters (kudos), and not really know what a parameter is, just that 175 billion is a lot.
Not to worry, we'll clear that up in this post. You'll be able to chat with your co-workers about the exciting new language model in about 5 minutes from now.
What Even Are Parameters
I get it, after a certain point the numbers all sort of blend together.
The broader definition is that “parameters” are different coefficients a machine learning model tries to optimize during training.
The plain English understanding is that for a language model, parameters represent the strength of connections between words and word sequences. More parameters means more opportunities to see different sequences in different contexts.
If a sentence starts with “Please pass the,” what is the probability that the next word will be “salt”?
What about “kangaroo”?
Chances are that most small models would never ever predict "please pass the kangaroo" as a likely sequence. Most humans wouldn’t, anyway. If a model has never seen that sequence in its training data, there is an infinitesimal chance of predicting "kangaroo" as the close word.
However, on the off-chance the training data is so large and parameters so numerous that the model spans many different domains, there may be a chance - however small - that sequence may come up.
What Are Domains
Traditionally, the domain of the training set is strongly correlated with performance on any one task.
For instance, a hate-speech detection model trained on language data from Wikipedia or another well-curated, grammatically sound corpus will rarely perform as well as a model trained on more informal sources like web forums, IRC, Discord chats, etc.. There are at least three reasons for this behavior:
This is because the kinds of insults, slurs, and syntactical/grammatical language nightmares on those informal sources are far more varied, numerous, and current than a formal list from Wikipedia
Each forum may have unique lingo, like Redditors using /r/subredditsashashtags, making it more difficult for models trained on other domains to predict subsequent words
Since there are usually more diverse contexts (i.e. the words around the slur) for forum-based sources, by using a model trained on that domain, we have a much better chance of recognizing contexts in the wild that match up with examples of hate speech the model has already seen
At least, these are the traditional limitations of language model effectiveness.
ELI5 GPT-3
And then of course, GPT-3 comes around. Over two orders of magnitude larger than its predecessor GPT-2, and ten times larger than any other model at the time of release, GPT-3 is renowned for its generalizability and emergent properties unseen in prior language models. In particular, a meta-learning feature called "in-context learning," which we'll come back to in a bit.
Few-shot learners
OpenAI's paper "Language Models are Few-Shot Learners" introduced GPT-3 in July 2020. The term "few-shot" (as opposed to "one-shot" or "zero-shot") refers to the number of examples a model has to observe before inferring the task it's trying to accomplish.
Put more colorfully, picture the following:
You're unexpectedly and violently woken up to the sound of a masked man blaring Queen's "Bohemian Rhapsody" through a megaphone in your bedroom.
At "Caught in a landslide..." the song stops, and this mystery person gestures to you.
With neither a coffee nor a clue as to why this person barged into your home for this, you muster up the softest, most confused "...no escape from reality."
If the task was to predict the next lyric in the song, congratulations! You're a one-shot learner.
If however, the task was something else - like a question-answer task "Who is the artist who sings this lyric?", then you wouldn't have gotten it right the first time. It might take a few more examples from the masked man with a megaphone before you get it right. Or before you call the police. This is called "few-shot learning."
GPT-3 is particularly good at few-shot learning for a wide variety of NLP tasks, including translation, question-answering, word unscrambling, and even 3-digit arithmetic. This is part of the reason why it's so generalizable, because a single model can be used for multiple applications.
In-context learning
The reason for this is that GPT-3 seems to infer the type of task it's being asked to solve much faster than smaller models. Its sheer size developed a broad range of pattern recognition skills, which are then used to quickly adapt to whatever unspecified task the user is prompting.
This meta-learning characteristic is called "in-context learning," which means that the model is able to infer the task it’s trying to accomplish on the fly, learning as it goes.
When prompted with even just a handful of examples, GPT-3 has remarkable accuracy on not just traditional NLP tasks, but also new adaptations like writing articles, guitar tabs, and even computer code that otherwise would have required specialized models.
Is this intelligence?
I wouldn't say so, more like extremely high-dimensional mimicry.
Recall that in Watch Your Language, Part 1, a Berkeley student was able to write entire productivity blog posts by prompting GPT-3 with a title and introduction. Does the fluency of the language, and its ability to fool human readers, imply it truly understands (as we would) causal links between sentences and aspects like "narrative"?
Not necessarily. Lifestyle blogs are one of the most popular blog formats, and posts on productivity are quite numerous. Because 60% of GPT-3's training mix is made up of crawled websites from all over the internet, it's safe to assume that the model has probably seen enough examples of lifestyle blogs to craft language that resembles them.
Some common tells are that GPT-3:
Has trouble maintaining narrative consistency over longer documents
Underperforms on language generation that requires logic and reasoning. Lifestyle blogs don't require either, which is how it was able to dupe readers so easily.
Regardless, GPT-3's size has undoubtedly made its meta-learning characteristics much more performant and easier to identify. It's still a mystery to most researchers exactly how in-context learning works, but the excitement is palpable to see how much further we can take it.
What is indisputable so far, is that the more data a model trains on, the better it generally seems to perform.
In Closing: The Cost of Capital
Which brings us to our final point. As models grow, so do their costs of capital. Using Google’s T5-11B (its previous 11 billion parameter language model)’s $1.3 million per run estimate (State of AI Report 2020, slide 17), after all rounds of training experts ballpark the training cost of transformer language models at ~$1 / 1,000 parameters.
Roughly speaking, that puts the final tally for GPT-3's training costs at around $175 million for a single model. There are very few institutions that can afford that kind of capital. And if you’ll recall from part 1, there are risks associated with such large models, including disturbing encoded biases and the budding replicability crisis.
To wrap this up, I'll repeat what Dr. Kai-Fu Lee of Sinovation Ventures recently shared when asked about the investment potential of massive language models at this year’s AAAI conference. He shares similar concerns about the growing tendency towards monopolistic advantage in large models. The full video and slides are in the 🎥 Watching section of this week’s bonus content.
I believe that the new, huge language models that appear to make a big difference, also come with a large cost in computation. That on the one hand has energy implications, but it also makes it harder and harder for universities to compete with internet giants. Google and Microsoft, they can build a $100mm computer, and train models with a trillion parameters, but professors and researchers cannot. Every country should think about how we make those types of resources available to researchers and startups so that the giant corporations don't end up with a... monopolistic advantage.
Bonus Content
Articles, reports, videos, and more worth checking out this week that didn’t make the featured cut
💹 AI in the Markets
Building a digital twin of the planet (ETH Zurich)
European and ETH Zurich computer scientists proposed developing a digital twin of the Earth for high-precision environmental and climate modeling across space and time. Researchers estimate this system would require ~20,000 GPUs, consuming an estimated 20MW of power.
SuperAnnotate, a no-code computer vision platform, partners with OpenCV (TechCrunch)
This partnership could lower the barrier to entry for lots of business leaders on cmputer vision applications. Personally I own 2 OpenCV OAK cameras. They come out-of-the-box with object detection inference models onboard, and are quite impressive.
🔧 Hardware & ASICs
Semi Demand 30% Above Supply, 20% Year-on-Year Growth (AnandTech)
TSMC, Samsung, UMC, and GlobalFoundries leading the pack, with China's SMIC in 5th. Inventories are drying up since "semi companies are shipping 10% to 30% below current demand levels, and it will take at least 3-4 quarters for supply to catch up with demand."
💡 Startups & Strategy
Why There's No Such Thing as a 'Startup Within a Big Company' (Waze vs Google)
Waze co-founder and CEO Noam Bardin left Google in January and published a personal essay detailing "the trickle-down problem" in thoughtful and heartbreaking detail.
🎧 Listening
The chip choke point. A single machine from the Netherlands could catapult China to the leading edge of the semiconductor industry. If the U.S. allowed it, that is. (The Wire China)
EUV light sources emit light in incredibly short wavelengths, less than 20nm, which is necessary to carve circuit features onto nodes less than 7nm. The company, ASML, has been courted by China for a decade for a potential acquisition, however the patent for EUV is actually owned by a US company. Without it, China will have to pursue alternative means to go from 14nm to 7nm.
🎥 Watching
AI Infusion & Investment Opportunities, with Kai-Fu Lee (AAAI-21)
Dr. Kai-Fu Lee (founder of Sinovation Ventures, author of AI Superpowers) highlights the contributing factors to China's competitive rise within applied AI, and identifies some oft-overlooked investment themes.
Thanks for reading!
Machine Yearning is a collection of essays and news on the intersection between AI, investing, product, and economics, light on technicals but heavy on relevance. Think of it as a casual chat about AI over coffee (or any other preferred beverage).
Ryan Cunningham is an AI Product Manager, strategist, and ex-investment banker. Today he leads applied AI strategy and new verticals at Spiketrap, an NLP-as-a-service company. He spent 4 years at Uber and is currently studying Artificial Intelligence part-time at Stanford, with a BS in Finance and Economics from Georgetown University.
Any suggestions or topics you want to see? Shoot me an email at rydcunningham@gmail.com or hit me up on Twitter / LinkedIn / Clubhouse @rydcunningham.