"It seems to sort of work most of the time but we don't really know why. Maybe if we throw more data and compute (i.e. $$$) at it the Holy Grail of AI will magically emerge (again, without us really understanding how or why it works)." Have I summarised correctly...?

Thanks! So like Andrey said at this point this summary feels uncharitable--but you are capturing the sense in which there is a tendency for people to look at deep learning as "just stack more layers [or throw more data] until the thing gets the right performance." But as Andrey said, quite a bit of work has begun to examine what precisely is happening in these over-parameterized / stupidly large neural nets, the dynamics of their behavior, how precisely the data and compute weigh on performance, etc.

Andrey posted a great source, and some more things you might look at in particular (we cited at least some of these):

That's pretty uncharitable, actually. There is a lot of work on understanding why/how deep learning works in generally. Scaling laws are by their nature empirical rather than theoretical discoveries, since they are connected to a specific model architecture and task. But even there, there is some really cool work being done to understand how and why scaling laws work (see eg https://www.anthropic.com/#papers).

What about "Broken Neural Scaling Laws"?:

https://arxiv.org/abs/2210.14891

https://arxiv.org/pdf/2210.14891.pdf

We published this piece a few months before that paper came out :)

"It seems to sort of work most of the time but we don't really know why. Maybe if we throw more data and compute (i.e. $$$) at it the Holy Grail of AI will magically emerge (again, without us really understanding how or why it works)." Have I summarised correctly...?

Thanks! So like Andrey said at this point this summary feels uncharitable--but you are capturing the sense in which there is a tendency for people to look at deep learning as "just stack more layers [or throw more data] until the thing gets the right performance." But as Andrey said, quite a bit of work has begun to examine what precisely is happening in these over-parameterized / stupidly large neural nets, the dynamics of their behavior, how precisely the data and compute weigh on performance, etc.

Andrey posted a great source, and some more things you might look at in particular (we cited at least some of these):

- Chinchilla (https://www.deepmind.com/publications/an-empirical-analysis-of-compute-optimal-large-language-model-training)

- Deep Double Descent (https://openai.com/blog/deep-double-descent/)

- (bit older) Scaling Laws for Neural Language Models (https://arxiv.org/abs/2001.08361) and associated papers

That's pretty uncharitable, actually. There is a lot of work on understanding why/how deep learning works in generally. Scaling laws are by their nature empirical rather than theoretical discoveries, since they are connected to a specific model architecture and task. But even there, there is some really cool work being done to understand how and why scaling laws work (see eg https://www.anthropic.com/#papers).