AI Scaling Laws and the Race to AGI – Part 1

Outline

Nvidia’s growth tied to AI Scaling Laws
The mad scramble for AGI
Evolution of GPT Models based on Parameters (N), Budget (C) and Dataset Size (D)
Kaplan & Chinchilla Scaling Laws – What do they tell us?
Does throwing money and GPUs at LLMs make them better?

Overview

Some analysts are saying that the markets today are like 1996 – the promise of a magical era where valuations didn’t matter as long as you dared to dream big and believe in the power of the emerging World Wide Web (Internet). Others say that a rude awakening awaits us and the markets are like 2000 – the year when the biggest bubble in financial history finally met a pin.

Implicit in both assumptions is the consensus that we are in the midst of another bubble. Inevitable conclusions are drawn between Cisco and Nvidia, the flagbearers of their respective bull runs.

There is no debate however, on Nvidia’s financial performance. Over 3.5 years, Nvidia’s stock price has gone up by 8.4x. Revenues have gone up 3.7x in the same duration. The profits however, have kept pace with the stock price and grown 7x.

The big question remains – How long can Nvidia maintain these astronomical growth rates?

The answer is surprisingly straightforward. The demand for Nvidia’s chips depend on Artificial Intelligence (AI) Scaling Laws. And if AI Scaling holds up, we will eventually achieve Artificial General Intelligence (AGI).

AI Scaling and AGI

The topic of AGI evokes strong emotions. Proponents paint a Shangri-La future where humanity is free from the tyranny of the 9-5 work week and people can aspire to greater things. The naysayers portray a dystopian future where humans are subservient to intelligent machines.

It would be fair to say that most AI companies are working towards model with AGI capabilities. This includes naysayers such as Elon Musk and his company xAI. OpenAI has a stated goal of creating AGI in its Charter.

OpenAI’s mission is to ensure that artificial general intelligence (AGI)—by which we mean highly autonomous systems that outperform humans at most economically valuable work—benefits all of humanity.
OpenAI Charter

Mark Zuckerberg has also pivoted to the goal of AGI.

“We’ve come to this view that, in order to build the products that we want to build, we need to build for general intelligence,”
Mark Zuckerberg to The Verge

AI models are ‘narrow’ in scope and their utility hits a brick wall when they encounter novel situations. Reliability was also a concern as evidenced by early LLM models not being able to calculate 2 plus 2. The plan is to improve LLMs by increasing the stacks of Graphic Processing Units (GPU) chips on which LLMs are trained. The hope is that at some point, LLMs will reach AGI levels of competence – we just have to keep piling on more chips.

This is a simplistic explanation of AI Scaling Laws. So far, throwing more compute (chips) at LLM training has worked.

But how long will the improvements in the performance of LLMs and other AI models last? At what stage will diminishing returns set in? Will we reach AGI before the Law of Diminishing Returns? These are some of the more difficult questions.

While there are no clear answers around AI Scaling, the stakes are high. Nvidia’s market capitalization of around $3 trillion reflects the underlying expectation that Scaling will lead to improvement in AI models. The occasional sharp draw-down in the share price also reflects the uncertainty on the future of scaling.

Key Characteristics of a Neural Network & GPT Evolution

We all know that LLMs have become better. To see how far they have improved, let’s look at:

a) Key Characteristics that define the AI Models such as ChatGPT

b) Performance of AI Models with an increase in Parameter counts

Neural Network-based AI Models such as ChatGPT can be defined by 4 key characteristics:

Size of the model – Number of Parameters (N)
Size of the training dataset – Dataset size on which the model is trained (D)
Cost of training – Hardware and software involved in training the model (C)
Error rate after training – Also called Loss (L)

Let’s look at the evolution of GPT-based AI Models.

GPT Model Scaling and GPU Usage

ChatGPT Version	Year	Model Size (N) – Parameters	Training Dataset Size (D)	Training Cost (C)	Number of GPUs for Training
GPT-2	Feb-2019	1.5 billion	4.5 GB; 40 billion tokens	“$256 per hour”	Guesstimate of 92 Nvidia V100 GPUs
ChatGPT-3	Jun-2020	175 billion	17 GB; 300 billion tokens	$4.6M – $12M	405-1024 Nvidia V100 GPUs
ChatGPT-4	Mar- 2023	1.8 trillion	45 GB; 13 trillion tokens	$100 million	25,000 Nvidia A100 GPUs

What is immediately apparent is that the Model Sizes are increasing significantly. ChatGPT-4 is bigger than GPT-2 and ChatGPT-3 by 1200x and 10x in terms of Model Size/Parameters (N).

The good news for Nvidia is that the need for its GPUs is increasing at a greater pace than the LLM Model Size. For a 10x increase in Parameters from ChatGPT-3 to ChatGPT-4, the training cost went up by anywhere between 8x to 22x!

Interestingly, the number of equivalent GPUs used to process the 10x increase in Parameters between ChatGPT-3 and 4 was much higher at 210x or 83x on lower side! So Jensen Huang’s inside joke of CEO math does hold true to an extent.

“The more you buy, the more you save”
(wink, wink)
Jensen Huang, CEO of Nvidia

(Nvidia A100 GPU used for GPT-4 is approximately 3.4x faster than the previous V100 GPU used to train GPT-3. Hence you need fewer A100 GPUs as compared to V100s used to train older models.)

More money = better AI?

Can we throw more money at AI and get to AGI faster? The Hyperscalers are saying ‘Hell yeah’.

Over the next 12 months, the four hyperscalers will be spending almost a quarter of a trillion dollars on capex to build out the infrastructure needed for all that horsepower coming through cloud data centers — these AI training and inference engines.
Source, 19 Aug 2024

AI Scaling Laws

The Hyperscalers are confidently increasing their AI Capex because of one simple reason – AI Scaling Laws.

In machine learning, a neural scaling law is an empirical scaling law that describes how neural network performance changes as key factors are scaled up or down. These factors typically include the number of parameters, training dataset size, and training cost.
Wikipedia definition of Neural scaling law

This means that there is a correlation between the key factors that define an AI Model/LLM:

Number of Parameters – N, typically billions of Parameters
Training dataset size – D, usually measured by gb (giga bytes) of training data or tokens
Training cost – C, dollar value spent on training the data

There are various mathematical equations or Scaling Laws that show the relationship between the key factors above.

Kaplan Scaling Laws
Chinchilla Scaling Law
Mosaic Scaling Laws
DeepSeek Scaling Laws
Tsinghua Scaling Laws
Llama Scaling Laws

All these Scaling Laws clearly show the following:

Positive correlation: There is a positive correlation between the various key factors of an LLM (parameters, dataset size and cost). Increasing any one of the key factors should result in the need to scale up the other factors to improve model performance.
For example, according to Chinchilla Scaling Law, “for every doubling of model size the number of training tokens should also be doubled.”
Optimum balance: An optimum balance can be determined for all the key factors, beyond which the increasing one of the factors while keeping the other static will not show any improvement in the model.
Power Law: All the Scaling Laws are Power Laws. This confirms what we suspected all along – the AI models can be scaled up to a significant extent but at some point, diminishing returns will set in.

Are LLMs getting any better?

LLMs like ChatGPT are increasing exponentially in ‘size’. That’s great news for Nvidia but are the LLMs becoming any better?

We can measure LLM performance primarily based on the following ‘Evaluations’:

Quality
Price
Speed

Additional ‘Evaluations’ to be taken into consideration are:

Latency
Context Window

We obviously know that LLMs are way better now than when they were originally introduced. They can do basic math and 1+1 is 2 for sure.

Let’s compare the recent versions of GPT to see the improvements.

GPT Versions Performance
(Source – Artificial Analysis)

Model	Quality (Higher is better)	Speed (Output tokens per second)	Price (USD per 1M tokens)
GPT-4o (Aug 2024 version)	77	102	0.8
GPT-4o mini (Jul 2024 version)	65	132	0.5
GPT-3.5 Turbo (16k version, Jun 2023)	53	83	0.8

We can immediately see that there is a marked improvement in ‘quality’ of the output (as measured by Artificial Analysis). This is evident even in versions of the same GPT4 released just a month later (Jul vs Aug 2024 versions). However, this comes at the cost of speed. In most cases, the decreased speed is an acceptable trade-off. Most people would prefer to wait a little bit versus getting an incorrect or unhelpful response. (Something that the next GPT version addresses by giving the model time to ‘think’ about their response).

Other LLMs also exhibit significant improvements in performance and reliability when measured against benchmarks such as MMLU, BIG-bench, and HumanEval.

An improvement in Quality (as measured by Artificial Analysis Quality Index) in GPT Models from 53 to 77 within a space of a year looks impressive. To fully understand the possibilities that open up with such an increase in Model ‘Quality’, consider the following examples:

United States Medical Licensing Examination (USMLE): GPT-4o comfortably vaulted over the passing score by over 20 points. Earlier versions couldn’t pass this test. It also outperformed specialized medical models like Med-PaLM1. (Source)
Uniform Bar Exam (UBE): GPT-4o scored in the 90th percentile on the UBE. GPT-3 scored in the bottom 10 percent. (Source)

Medicine and Law are two of the most prestigious professions. The ability of LLMs to perform better than their most gifted human counterparts show that AI is no longer a parlour trick.

It’s clear that increases in LLM ‘size’ are improving their utility.

One of the things we had learned from research is that the larger the model, the more data you have and the longer you can train, the better the accuracy of the model is,”
Nidhi Chappell, Microsoft head of product for Azure high-performance computing and AI, Source

Conclusion

We have seen GPT models go from 1.5 billion Parameters to 1.8 trillion Parameters. The early models cost less than a million dollars but by the time GPT-4 came, the Training Costs (C) were estimated at $100 million.

If you thought GPT-4 was expensive, you ain’t seen nothing yet! Anthropic CEO Dario Amodei says training costs could reach $100 billion soon.

Amodei said AI training costs will hit the $10 billion and $100 billion mark over the course of “2025, 2026, maybe 2027”, repeating his prediction that $10 billion models could start appearing sometime in the next year.
Source

The increase in Model Sizes is possible due to the ability of GPUs to scale up exponentially and meet the demands of the ravenous LLM Models. To everyone’s relief, all this extra spend on improving/optimizing LLMs and GPU hardware is yielding improvements in the quality of the responses coming from LLMs. This keeps the dreams of AGI still alive.

In Part 2, we will look at the AI Scaling Laws and ascertain the stage where diminishing returns could set in. Will this happen before AGI? That is the trillion dollar question.

Disclaimer: The information provided on this blog is for general informational purposes only and should not be construed as professional financial advice. The author(s) of this blog may hold positions in the securities mentioned. Past performance is not indicative of future results. The author(s) make no representations as to the accuracy, completeness, suitability, or validity of any information on this site and will not be liable for any errors, omissions, or any losses, injuries, or damages arising from its display or use.

By using this blog, you agree to the terms of this disclaimer and assume full responsibility for any actions taken based on the information provided herein.