5 Comments

While listening to the episode, I wondered why there's extensive discussion on fairly extreme views like those of Yudkowsky, yet no mention of people from the opposite side of the spectrum, such as Emily Bender or Timnit Gebru?

Expand full comment

Good point! We did discuss these views to some extent in our XAI-focused episode here: https://lastweekin.ai/p/ai-and-existential-risk-overview#details

Expand full comment

Hey guys, what is the strongest academic level book or article on the hawkish side of AI safety? Something by Yudkowsky maybe?

Expand full comment

Hey! Here's a response from Jeremie:

It's difficult to pin down a single one, because the big picture argument for AI catastrophic risk is that there are many, mostly independent reasons to expect advanced AIs to behave dangerously. If any one of these assumptions holds, then the result is likely catastrophic.

Here are some of the arguments I personally find most convincing. For each, I've linked to a line of research that suggests that they are likely correct.

Designing goals that can't be hacked using a dangerously creative strategy is very hard. If it's clever enough, an AI will find ways to hack its goal with one of these dangerously creative strategies. For example, a sufficiently intelligent version of ChatGPT that's given the goal of optimizing for upvotes to its responses may end up trying to generate compelling, but not truthful responses. In the limit, it may even try to use its access to the internet to hack its own reward circuitry. DeepMind's Specification gaming: the flip side of AI ingenuity is a great primer: https://www.deepmind.com/blog/specification-gaming-the-flip-side-of-ai-ingenuity

Even if we can design goals that are safe when pursued by an arbitrarily intelligent system, there's no guarantee that an AI that we train against those goals will actually try to pursue them. This is known as the inner alignment problem, and it's widely considered to be the real "hard part" of AI alignment. Empirical Observations of Objective Robustness Failures provides empirical evidence for inner alignment risk: https://www.alignmentforum.org/posts/iJDmL7HJtN5CYKReM/empirical-observations-of-objective-robustness-failures

For any goal that we can give an AI system, it is never more likely to achieve that goal if it is turned off. Likewise, it's never more likely to achieve it if it has access to fewer resources or is less intelligent. Even given a pretty benign goal, AI systems have implicit incentives to prevent themselves from being turned off, to self-improve, and collect resources. These are known as power-seeking behaviors, and there's a lot of robust research showing that they are the default behaviors we should expect from sufficiently intelligent systems. Optimal Policies Tend to Seek Power is a theoretical argument for risk from power-seeking: https://arxiv.org/abs/1912.01683; it was also more recently supported by experimental research that showed how power-seeking behaviors can already be observed in low-capability RL systems: https://www.alignmentforum.org/s/HBMLmW9WsgsdZWg4R/p/pGvM95EfNXwBzjNCJ

For a more comprehensive qualitative overview of arguments by Yudkowsky, AGI Ruin: A List of Lethalities is usually the go-to: https://www.lesswrong.com/posts/uMQ3cqWDPHhjtiesc/agi-ruin-a-list-of-legalities

If terminology is helpful, the risk sources 1 - 3 are respectively known as outer alignment, inner alignment, and power-seeking.

Expand full comment

Wow, this is really great thanks! I owe you guys a drink next time you are in NYC

Expand full comment