Scaling Hypothesis

2024-12-20

Table of Contents

Thoughts

I learned recently about two analogies that made me think scaling AI intelligence is more likely: telescopes, and Moore's law

The telescope analogy makes the argument that more compute and data lead to more AI algorithms and intelligence, not the other way around. Without telescopes, not much can be learned about astronomy, but with telescopes and the ability to make experiments, to observe phenomena happening in the sky, you can make so many more advancements. The same is true for AI. So much research was published back in 1900's that are only viable with compute, so they weren't explored. And now that we do have more compute, more research and experiments are able to be done.

Moore's law is the observation that the number of transistors in an integrated circuit (IC) doubles about every two years. Moore's law is an observation and projection of a historical trend. [1]

This law wasn't based on physics, but is substantiated by people continually putting in huge capital and skill investments to make it happen. There are thousands or tens of thousands of PhDs working on semiconductors that barely know what's going on in the whole stack to create these advancements.

A key component to Moore's law is that these researchers keep discovering new growth curves to keep the overall growth up after the first low hanging fruit. After the end of Dennard scaling, they developed new scaling laws based on different tech. People argue that it's similar for LLMs now. There was early scaling due to internet data and more compute. As we run out of internet pretraining data, we will turn to fine tuning, different data sources, other training regimes, generated data, etc to scale. The argument is that as long as Moore's law continues to advance, so will the "intelligence" of the models.

Previously, I was more along the lines of a skeptic. It's hard to tell what's going on with scaling laws, big CEO personalities, and a huge hype bubble. I thought a large part was bullshit, but now I think it's both bullshit and there's something real there.

As far as AGI goes, I thought previously that it was likely that we've run out of data on the internet. It also seemed to me like the algorithms needed to get to the next level of intelligence weren't just scaled LLMs.

But looking at the sheer rate of progress and investment that is occurring along with these two new analogies, it's hard to believe that we are near the end of the road. The sheer rate of research, adoption by companies, intelligence increase in a few years is just astounding. I don't know when the end will be, but it doesn't seem like anytime soon.

What does this mean?

Nobody knows when the end of the intelligence increase is. Nobody knows what companies/people/technologies will end up winning. Nobody knows what it more intelligence will look like.

I think that as LLMs exist now without further research breakthroughs (but with engineering optimizations), they'll change the economy a lot in some way. For some simple examples, you could think about the increasing adoption of ChatGPT, and the potential to increase or decrease a company's bureaucracy tremedously through text. If it does change the economy significantly, I can see it automating or reducing the need for a lot of people's text based jobs.

Predictions from others that I agree with:

It can be hard to see past the tech world and tech bubble, but most people don't care about AI. It may end up being something incredibly important, it may just be another offering of a tech giant. But until it impacts people in their day to day life, they're not likely to care. Maybe you shouldn't either, maybe not.

What should you work on?

I think it's probably worth thinking about how you think cheap intelligence will change the world, and when/if you think scaling will end. There is no right answer on how to spend your time, but there is a risk to being "left behind" if you don't keep up with AI's progress depending on where it ends up. If you believe in the hype, white collar jobs might be automated or semi-automated by 2030.

If you're an Effective Altruist, you'd rationalize working on this as the most important and impactful thing in your life because of it's potential.

If you live in the real world, you'd probably agree more with this:

"If everybody contemplates the infinite instead of fixing the drains, many of us will die of cholera" - John Rich [1]

References

The following is a list of links and selected quotations that I found interesting and helpful when writing this post.

Predictions of Scaling Hypothesis

https://en.wikipedia.org/wiki/Hans_Moravec

In When will computer hardware match the human brain (1998), he estimated that human brains operate at about 10^15 instructions per second, and that, if Moore's law continues, a computer with the same speed would cost only 1000 USD (1997 dollars) in mid-2020s, thus "computers suitable for humanlike robots will appear in the 2020s".

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

The biggest lesson that can be read from 70 years of AI research is that general methods that leverage computation are ultimately the most effective, and by a large margin. The ultimate reason for this is Moore's law, or rather its generalization of continued exponentially falling cost per unit of computation. Most AI research has been conducted as if the computation available to the agent were constant (in which case leveraging human knowledge would be one of the only ways to improve performance) but, over a slightly longer time than a typical research project, massively more computation inevitably becomes available. Seeking an improvement that makes a difference in the shorter term, researchers seek to leverage their human knowledge of the domain, but the only thing that matters in the long run is the leveraging of computation. These two need not run counter to each other, but in practice they tend to. Time spent on one is time not spent on the other. There are psychological commitments to investment in one approach or the other. And the human-knowledge approach tends to complicate methods in ways that make them less suited to taking advantage of general methods leveraging computation. There were many examples of AI researchers' belated learning of this bitter lesson, and it is instructive to review some of the most prominent.

https://gwern.net/scaling-hypothesis

The blessings of scale in turn support a radical theory: an old AI paradigm held by a few pioneers in connectionism (early artificial neural network research) and by more recent deep learning researchers, the scaling hypothesis. The scaling hypothesis regards the blessings of scale as the secret of AGI: intelligence is ‘just’ simple neural units & learning algorithms applied to diverse experiences at a (currently) unreachable scale. As increasing computational resources permit running such algorithms at the necessary scale, the neural networks will get ever more intelligent.

Depending on what investments are made into scaling DL, and how fast compute grows, the 2020s should be quite interesting—sigmoid or singularity?

Plenty of Room at the Bottom

https://www.dwarkeshpatel.com/p/dylan-jon

Dylan Patel: No, Sundar said this on the earnings call. So Zuck said it. Sundar said it. Satya's actions on credit risk for Microsoft do it. He's very good at PR and messaging, so he hasn't said it so openly.

Sam believes it. Dario believes it. You look across these tech titans, they believe it. Then you look at the capital holders. The UAE believes it. Saudi believes it.

Dwarkesh Patel: How do you know the UAE and Saudi believe it?

Dylan Patel: All these major companies and capital holders also believe it because they're putting their money here.

Jon Y: But it won't last, it can't last unless there's money coming in somewhere.

Dylan Patel: Correct, correct, but then the question is... The simple truth is that GPT-4 costs like $500 million dollars to train. It has generated billions in recurring revenue. In the meantime, OpenAI raised $10 billion or $13 billion and is building a model that costs that much

Many big tech companies are following Pascal's wager. If AI is the next big thing, great. They're investing a lot in that area. If they don't invest in that area and it is a big deal, then they're left behind in a terrible spot.

https://www.dwarkeshpatel.com/p/gwern-branwen

So even if you appreciate the role of trial and error and compute power in your own experiment as a researcher, you probably just think, “Oh, I got lucky that way. My experience is unrepresentative. Over in the next lab, there they do things by the power of thought and deep insight.”

Then it turns out that everywhere you go, compute and data, trial and error, and serendipity play enormous roles in how things actually happened. Once you understand that, then you understand why compute comes first. You can't do trial and error and serendipity without it. You can write down all these beautiful ideas, but you just can't test them out.

Even a small difference in hyperparameters, or a small choice of architecture, can make a huge difference to the results. When you only can do a few instances, you would typically find that it doesn't work, and you would give up and you would go away and do something else.

Whereas if you had more compute power, you could keep trying. Eventually, you hit something that works great. Once you have a working solution, you can simplify it and improve it and figure out why it worked and get a nice, robust solution that would work no matter what you did to it. But until then, you're stuck. You're just flailing around in this regime where nothing works.

https://semianalysis.com/2024/12/11/scaling-laws-o1-pro-architecture-reasoning-training-infrastructure-orion-and-claude-3-5-opus-failures/#

As with computer enthusiasts in 2004, mainstream analysts and journalists are missing the forest for the trees: despite the slowing down of one trend, the industry collectively remains moving forward at a breakneck pace due to other new emerging paradigms that are ripe for scaling and expansion. It is possible to stack “scaling laws” – pre-training will become just one of the vectors of improvement, and the aggregate “scaling law” will continue scaling just like Moore’s Law has over last 50+ years.

Is the capital there? Is this a bubble?

Lessons from History: The Rise and Fall of the Telecom Bubble

Unlike in 2000, the pure AI companies of interest today are private or capped profit vehicles. So there’s a loss of visibility because they are private. AI will result in successful companies, but I believe many of the small AI companies of today will fail to make it. Tomorrow’s winners are probably born today, but that’s some time. What I think has the most direct analogy is the capacity building during the telecom bubble and today’s AI infrastructure spending splurge. So let’s talk about it.

The capacity to raise leverage for large companies is staggering. The big players could easily raise 100 billion in debt (if the market could stomach it) and reach debt neutral. That is a lot of potential firepower for AI purchases. This analysis is a bit hypothetical, as the current rate for debt is high, a conversation for the rates portion of the differences.

Let’s compare this to AI today. Everyone is GPU-poor and constrained today, and the leading-edge models we want to train are constrained by computing, memory, and data. But if history is a guide, supply reacts to demand.

Recently, Nvidia released a new roadmap pushing the cadence of product announcements to every year from every other year. This is a perfect example of supply reacting quickly to demand.

There’s a lot of training demand, but supply is reactive and will eventually deflate overshoot. It’s not a question of if but when supply will overshoot demand.

What are the limits?

https://www.fabricatedknowledge.com/p/scaling-laws-meet-economics-but-adoption

Technological model scaling is not dead, but we are clearly at a critical moment when the definitions will shift. Pre-training, the original scaling law, might hit its first diminishing return. The piece SemiAnalysis used the analogy that pre-training was like the end of Dennard’s law. However, multi-core scaling led to another decade of scaling transistors. Technology progresses, just not in the same way.

Now it’s time for my favorite part - the one thing no one seems to be talking about or cares about. Adoption is still accelerating, and that part of the discourse appears empty.

https://www.dwarkeshpatel.com/p/will-scaling-work

But most things in life are harder than in theory, and many theoretically possible things have just been intractably difficult for some reason or another (fusion power, flying cars, nanotech, etc). If self-play/synthetic data doesn’t work, the models look fucked - you’re never gonna get anywhere near that platonic irreducible loss. Also, the theoretical reason to expect scaling to keep working are murky, and the benchmarks on which scaling seems to lead to better performance have debatable generality.

So my tentative probabilities are: 70%: scaling + algorithmic progress + hardware advances will get us to AGI by 2040. 30%: the skeptic is right - LLMs and anything even roughly in that vein is fucked.

I’m probably missing some crucial evidence - the AI labs are simply not releasing that much research, since any insights about the “science of AI” would leak ideas relevant to building the AGI. A friend who is a researcher at one of these labs told me that he misses his undergrad habit of winding down with a bunch of papers - nowadays, nothing worth reading is published. For this reason, I assume that the things I don’t know would shorten my timelines.

https://idlewords.com/talks/superintelligence.htm

So we've created a very powerful system of social control, and unfortunately put it in the hands of people who run it are distracted by a crazy idea.

What I hope I've done today is shown you the dangers of being too smart. Hopefully you'll leave this talk a little dumber than you started it, and be more immune to the seductions of AI that seem to bedevil smarter people.

We should all learn a lesson from Stephen Hawking's cat: don't let the geniuses running your industry talk you into anything. Do your own thing!

https://en.wikipedia.org/wiki/Moravec%27s_paradox

Moravec's paradox is the observation in the fields of artificial intelligence and robotics that, contrary to traditional assumptions, reasoning requires very little computation, but sensorimotor and perception skills require enormous computational resources. The principle was articulated in the 1980s by Hans Moravec, Rodney Brooks, Marvin Minsky, and others. Moravec wrote in 1988: "it is comparatively easy to make computers exhibit adult level performance on intelligence tests or playing checkers, and difficult or impossible to give them the skills of a one-year-old when it comes to perception and mobility".

Similarly, Minsky emphasized that the most difficult human skills to reverse engineer are those that are below the level of conscious awareness. "In general, we're least aware of what our minds do best", he wrote, and added: "we're more aware of simple processes that don't work well than of complex ones that work flawlessly". Steven Pinker wrote in 1994 that "the main lesson of thirty-five years of AI research is that the hard problems are easy and the easy problems are hard".