Designing stoppable AIs

Some time ago I saw this instructive video on computer science and artificial intelligence:

This recent Vanity Fair article touches on some of the same questions, namely how you design a safety shutdown switch that the AI won’t trigger itself and won’t stop you from triggering. It quites Eliezer Yudkowsky:

“How do you encode the goal functions of an A.I. such that it has an Off switch and it wants there to be an Off switch and it won’t try to eliminate the Off switch and it will let you press the Off switch, but it won’t jump ahead and press the Off switch itself? … And if it self-modifies, will it self-modify in such a way as to keep the Off switch? We’re trying to work on that. It’s not easy.”

I certainly don’t have any answers, and in fact find it a bit surprising and counterintuitive that the problem is so hard.

The logic does fairly quickly become straightforward though. Imagine an AI designed to boil a tea kettle unless the emergency stop is pushed. If it is programmed to care more about starting the kettle than paying attention to the shutdown switch, then it will choose to boil water regardless of attempts at shutdown, or even to try to stop a person from using the switch. If it is programmed to value obeying the shutdown switch more then it becomes presented with the temptation to push the switch itself and thus achieve a higher value goal.

8 thoughts on “Designing stoppable AIs”

  1. Is Power-Seeking AI an Existential Risk?

    Joseph Carlsmith

    This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire — especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. Second, I formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. On this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. I assign rough subjective credences to the premises in this argument, and I end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (May 2022 update: since making this report public in April 2021, my estimate here has gone up, and is now at >10%.)

  2. But while divides remain over what to expect from AI — and even many leading experts are highly uncertain — there’s a growing consensus that things could go really, really badly. In a summer 2022 survey of machine learning researchers, the median respondent thought that AI was more likely to be good than bad but had a genuine risk of being catastrophic. Forty-eight percent of respondents said they thought there was a 10 percent or greater chance that the effects of AI would be “extremely bad (e.g., human extinction).”

    It’s worth pausing on that for a moment. Nearly half of the smartest people working on AI believe there is a 1 in 10 chance or greater that their life’s work could end up contributing to the annihilation of humanity.

    It might seem bizarre, given the stakes, that the industry has been basically left to self-regulate. If nearly half of researchers say there’s a 10 percent chance their work will lead to human extinction, why is it proceeding practically without oversight? It’s not legal for a tech company to build a nuclear weapon on its own. But private companies are building systems that they themselves acknowledge will likely become much more dangerous than nuclear weapons.

  3. “Quite rapid takeoff — before the scientist model is deployed, the pace of scientific and economic progress in the world is roughly similar to what it is today; after it’s deployed, the effective supply of top-tier talent is increased by orders of magnitude. This is enough to pack decades of innovation into months, bringing the world into a radically unfamiliar future10 — one with digital people, atomically precise manufacturing, Dyson spheres, self-replicating space probes, and other powerful technology that enables a vast and long-lasting galaxy-scale civilization — within a few years.”

Leave a Reply

Your email address will not be published. Required fields are marked *