AI Series

How Self-Supervised Learning has "Transformed" Natural Language Processing

The Rise of Transformer-Based Large Language Models

Stephanie Curcio, James Stonehill, and Yasamin Allahverdi

In our last blog post, we set the stage by briefly skimming the long history of language-based AI systems, ending with a revolutionary milestone that has redefined the field: transformer-based large language models (LLMs). No, these models have nothing to do with Optimus Prime; the only thing these transformers have “shape-shifted” is how machines understand and process language.

How LLMs Interpret the Nuances in Language

Transformer-based models were first popularized in a 2017 seminal research paper authored by a team from Google and University of Toronto (shout out!), since hailed as outlining one of the most powerful advancements in language-based AI. So, what makes transformer-based models special? They use deep learning techniques to uncover context and relationships in sequential data - like tracking “logical” connections between words, even when they’re far apart in a sentence or paragraph.

Because of this, a transformer-based LLM, unlike most older model architectures, can (among other things) distinguish between the same word that has a different meaning depending on the context in which it is used. Take this example:‍

Sentence 1: This AI model can run thousands of simulations in seconds.

Sentence 2: I tried to run yesterday, but my fitness tracker said, ‘Are you even trying?’

Even skimming the above sentences, you’ll immediately understand that the word "run" takes on two entirely different meanings between Sentence 1 vs. Sentence 2. We take nuances in language like this for granted but these kinds of linguistic nuances are really difficult for a computer to understand and historically, most language models would have assigned the same meaning to both words. However, just as you or I understand that each occurrence is distinct based on context, a transformer-based LLM is able to understand the difference by analyzing the words surrounding the word “run” in each sentence. And the crazy part is, it learns how to do this on its own through a training task called token masking, without any human intervention! This is known as self-supervised learning. Token masking works by randomly hiding words in the training set, but providing the surrounding context, and getting the model to predict the hidden word. As the model gets better at predicting the hidden word it also builds out a generalized understanding of how language works, which can be applied to other linguistic tasks. This is how we’re able to use these models for patent search!

NLP’s ‘ImageNet Moment’

Transformer-based LLMs mark a seismic shift in what’s possible with language-based AI systems. A helpful parallel can be drawn in the realm of computer vision. In computer vision, the release of ImageNet - a massive dataset of labeled images - transformed the field by enabling machine learning models to achieve groundbreaking performance in recognizing and categorizing visual information. In natural language processing (NLP), transformer-based LLMs have had a similar impact. Often referred to in the machine learning community as NLP’s “ImageNet moment”, the introduction of these models demonstrated the power of training on vast, unlabeled text datasets. Using self-supervised learning, LLMs capture the complexities of language, uncovering patterns and relationships without human-labeled data. This approach has significantly expanded what’s possible in NLP, enabling models to adapt to a wide range of language tasks with unprecedented accuracy and flexibility.

Transformer-Based Models Are Not “One Size Fits All”

Today, transformer-based models come in all shapes and sizes that are optimized for different tasks (teaser alert! We will dig further into this in a future post). But we’d be remiss not to at least mention the best known application that made “large language model” part of everyone’s vocabulary: OpenAI’s ChatGPT - where GPT stands for "Generative Pre-trained Transformer." GPT is pre-trained on massive datasets to learn general language patterns like grammar, word relationships, and context. This pre-training equips it to handle a wide variety of tasks, such as summarization, translation, or question-answering, once fine-tuned for specific purposes. In ChatGPT’s case, the underlying model was further refined using conversational data and a technique called reinforcement learning from human feedback (RLHF), which involved real people evaluating the model’s responses, teaching it to craft more accurate and intuitive answers. At NLPatent, we use similar techniques to fine-tune our language models to be particularly well suited for patent research (again, more on this later…).

Although similarly beyond the scope of this post, it’s worth noting that transformer-based models aren’t just used for language-based tasks. Their impact stretches into the field of medicine - helping diagnose diseases, develop treatment plans, and advance scientific research. For example, AstraZeneca and NVIDIA developed a transformer-based model called MegaMolBART, which understands chemistry and aids in drug discovery by analyzing molecular data.

Teaser: Why Are LLMs So Large?

From generating human-like text to assisting in scientific breakthroughs, transformer-based models are highly adaptable, making them truly foundational to modern AI. And if that hasn’t blown your mind yet, here’s a fun teaser for our next post: ChatGPT’s underlying model, GPT-3, was trained on a dataset of around 45 TB of raw text, which was then filtered down to about 570 GB - equivalent to 300 billion words. Stay tuned as we explore what makes large language models so large and how they handle such massive amounts of data to push the limits of (artificially) human intelligence and creativity.

Share on LinkedIn

Share on Twitter

How Self-Supervised Learning has "Transformed" Natural Language Processing

How LLMs Interpret the Nuances in Language

NLP’s ‘ImageNet Moment’

Transformer-Based Models Are Not “One Size Fits All”

Teaser: Why Are LLMs So Large?

More Blog Posts Coming Soon!