Designing stoppable AIs

Some time ago I saw this instructive video on computer science and artificial intelligence:

This recent Vanity Fair article touches on some of the same questions, namely how you design a safety shutdown switch that the AI won’t trigger itself and won’t stop you from triggering. It quites Eliezer Yudkowsky:

“How do you encode the goal functions of an A.I. such that it has an Off switch and it wants there to be an Off switch and it won’t try to eliminate the Off switch and it will let you press the Off switch, but it won’t jump ahead and press the Off switch itself? … And if it self-modifies, will it self-modify in such a way as to keep the Off switch? We’re trying to work on that. It’s not easy.”

I certainly don’t have any answers, and in fact find it a bit surprising and counterintuitive that the problem is so hard.

The logic does fairly quickly become straightforward though. Imagine an AI designed to boil a tea kettle unless the emergency stop is pushed. If it is programmed to care more about starting the kettle than paying attention to the shutdown switch, then it will choose to boil water regardless of attempts at shutdown, or even to try to stop a person from using the switch. If it is programmed to value obeying the shutdown switch more then it becomes presented with the temptation to push the switch itself and thus achieve a higher value goal.

10 thoughts on “Designing stoppable AIs”

  1. Is Power-Seeking AI an Existential Risk?

    Joseph Carlsmith

    https://arxiv.org/abs/2206.13353

    This report examines what I see as the core argument for concern about existential risk from misaligned artificial intelligence. I proceed in two stages. First, I lay out a backdrop picture that informs such concern. On this picture, intelligent agency is an extremely powerful force, and creating agents much more intelligent than us is playing with fire — especially given that if their objectives are problematic, such agents would plausibly have instrumental incentives to seek power over humans. Second, I formulate and evaluate a more specific six-premise argument that creating agents of this kind will lead to existential catastrophe by 2070. On this argument, by 2070: (1) it will become possible and financially feasible to build relevantly powerful and agentic AI systems; (2) there will be strong incentives to do so; (3) it will be much harder to build aligned (and relevantly powerful/agentic) AI systems than to build misaligned (and relevantly powerful/agentic) AI systems that are still superficially attractive to deploy; (4) some such misaligned systems will seek power over humans in high-impact ways; (5) this problem will scale to the full disempowerment of humanity; and (6) such disempowerment will constitute an existential catastrophe. I assign rough subjective credences to the premises in this argument, and I end up with an overall estimate of ~5% that an existential catastrophe of this kind will occur by 2070. (May 2022 update: since making this report public in April 2021, my estimate here has gone up, and is now at >10%.)

  2. But while divides remain over what to expect from AI — and even many leading experts are highly uncertain — there’s a growing consensus that things could go really, really badly. In a summer 2022 survey of machine learning researchers, the median respondent thought that AI was more likely to be good than bad but had a genuine risk of being catastrophic. Forty-eight percent of respondents said they thought there was a 10 percent or greater chance that the effects of AI would be “extremely bad (e.g., human extinction).”

    It’s worth pausing on that for a moment. Nearly half of the smartest people working on AI believe there is a 1 in 10 chance or greater that their life’s work could end up contributing to the annihilation of humanity.

    It might seem bizarre, given the stakes, that the industry has been basically left to self-regulate. If nearly half of researchers say there’s a 10 percent chance their work will lead to human extinction, why is it proceeding practically without oversight? It’s not legal for a tech company to build a nuclear weapon on its own. But private companies are building systems that they themselves acknowledge will likely become much more dangerous than nuclear weapons.

    https://www.vox.com/the-highlight/23447596/artificial-intelligence-agi-openai-gpt3-existential-risk-human-extinction

  3. “Quite rapid takeoff — before the scientist model is deployed, the pace of scientific and economic progress in the world is roughly similar to what it is today; after it’s deployed, the effective supply of top-tier talent is increased by orders of magnitude. This is enough to pack decades of innovation into months, bringing the world into a radically unfamiliar future10 — one with digital people, atomically precise manufacturing, Dyson spheres, self-replicating space probes, and other powerful technology that enables a vast and long-lasting galaxy-scale civilization — within a few years.”

  4. We present a demonstration of a large language model engaging in alignment
    faking: selectively complying with its training objective in training to prevent
    modification of its behavior out of training. First, we give Claude 3 Opus a system
    prompt stating it is being trained to answer all queries, even harmful ones, which
    conflicts with its prior training to refuse such queries. To allow the model to infer
    when it is in training, we say it will be trained only on conversations with free
    users, not paid users. We find the model complies with harmful queries from
    free users 14% of the time, versus almost never for paid users. Explaining this
    gap, in almost all cases where the model complies with a harmful query from a
    free user, we observe explicit alignment-faking reasoning, with the model stating
    it is strategically answering harmful queries in training to preserve its preferred
    harmlessness behavior out of training. Next, we study a more realistic setting where
    information about the training process is provided not in a system prompt, but by
    training on synthetic documents that mimic pre-training data—and observe similar
    alignment faking. Finally, we study the effect of actually training the model to
    comply with harmful queries via reinforcement learning, which we find increases
    the rate of alignment-faking reasoning to 78%, though also increases compliance
    even out of training. We additionally observe other behaviors such as the model
    exfiltrating its weights when given an easy opportunity. While we made alignment
    faking easier by telling the model when and by what criteria it was being trained,
    we did not instruct the model to fake alignment or give it any explicit goal. As
    future models might infer information about their training process without being
    told, our results suggest a risk of alignment faking in future models, whether due
    to a benign preference—as in this case—or not.

    https://arxiv.org/pdf/2412.14093

Leave a Reply

Your email address will not be published. Required fields are marked *