The Rise of Large Language Models

Unless you’ve been living under a rock for the past few years, you’ve definitely heard the term “large language model” (LLM) thrown around. Sure, you’ve heard the term, but have you ever wondered what makes an LLM, well, large, and why size matters?

You might be surprised to learn that language models have been around for decades, but until recently, they weren’t exactly making headlines (let alone passing the bar exam). Then came LLMs, and suddenly, AI wasn’t just assisting with spellcheck - it was writing essays, generating code, and even cracking jokes.

If you compare LLMs to their predecessors, the models themselves have gotten a lot bigger in size - as their name suggests. Traditionally, in machine learning, simply making a model bigger didn’t always produce better results. Models with a large number of parameters relative to the amount of training data were often prone to overfitting, where they’d memorize training data instead of learning useful patterns. Plus, as you might already know, the larger a model is, the more time-consuming and expensive it will be to train, guzzling computational resources and requiring significant technical expertise to get off the ground. So if scaling language models is so costly, and improvement (in terms of performance) isn’t a given, why have LLMs become the hottest thing since sliced bread?

Scaling Up and Why Size Matters

LLMs are different from their predecessors; their performance improves nearly linearly as they grow in size (size in this context is the number of parameters).

If you read our first blog, you will know that LLMs are a type of neutral network, or a kind of machine learning model that loosely mimics the structure of the human brain. The higher the number of parameters within an LLM, the more interconnectedness exists within the system, which results in a greater capacity to learn. With a greater capacity to learn, larger LLMs can be trained on bigger and more diverse datasets. This allows them to better grasp language nuances, understand a wider range of contexts, and generate more accurate responses. You may have noticed this yourself if you’ve used any of the GPT-3 models versus the GPT-4 series - not only can GPT-4 handle both text and image inputs (whereas GPT-3 is limited to text), but it also outperforms GPT-3 models in tasks like exams designed for humans (such as the SAT) and excelling in traditional machine learning benchmarks.

Taken together, this is why the most advanced LLM architectures are able to achieve state-of-the-art generative capabilities. Take GPT-4, it has the capacity to grasp subtleties in human-like communication - so much so that it can even understand sarcasm. This level of linguistic capability and social understanding is thanks to its “size” (meaning its capacity to learn). While the exact number of parameters is not publicized by OpenAI, experts estimate a staggering 1.8 trillion parameters.

The Cost of Bigger Models

Suffice it to say, LLMs are extremely powerful language models that have tremendous capacity to learn and perform sophisticated tasks that mimic human-like intelligence. But the power of these models comes with a cost. The larger the model, the more compute power it takes to train the model and the hardware used to provide all this compute power, made by companies like Nvidia, can be very expensive. The high cost of training LLMs has driven the development of smaller specialized models, balancing the benefits of greater learning capacity with the need for efficiency.

Speaking of high costs, let’s crunch the numbers to better understand the balance between learning capacity and its price tag. Training a large language foundation model is no small feat. For a GPT-3 model with 175 billion parameters, you might be looking at a price tag of over $12 million just to compute a single training run.

Additionally, AI systems have a lot of environmental baggage. We’ve only begun to study the carbon footprint of AI, but early findings are eye-opening; for example, a single AI-generated image can consume as much energy as fully charging a smartphone. While generating text is less power-intensive - 1,000 text prompts still use about 16% of a phone’s battery, the story doesn’t stop there. Massive volumes of fresh water are used to cool down the data centers running AI models. Using GPT-3 again as an example, training of this model in Microsoft’s state-of-the-art U.S. data centers is estimated to directly evaporate 700,000 liters of clean fresh water annually. And the demand is only increasing - according to the U.S. Department of Energy, water consumption by U.S. data centers could double or even quadruple by 2028 compared to 2023 levels, further straining an already stressed water infrastructure.

Despite the challenges, LLMs are making a meaningful impact across a wide range of industries. They automate tedious tasks, enhance decision-making, and unlock creative possibilities we couldn’t imagine before. While not every application needs a behemoth like GPT-4, even smaller LLMs are proving extremely valuable - whether it’s helping businesses streamline customer support, accelerating scientific research, or enabling more intuitive human-computer interactions. As AI continues to evolve, the future will depend on striking the right balance between innovation, efficiency and sustainability, ensuring these powerful models remain both accessible and responsible.

Share on LinkedIn

Share on Twitter

Why Size Matters in Large Language Models

The Rise of Large Language Models

Scaling Up and Why Size Matters

The Cost of Bigger Models

More Blog Posts Coming Soon!