Some time ago I saw this instructive video on computer science and artificial intelligence:
This recent Vanity Fair article touches on some of the same questions, namely how you design a safety shutdown switch that the AI won’t trigger itself and won’t stop you from triggering. It quites Eliezer Yudkowsky:
“How do you encode the goal functions of an A.I. such that it has an Off switch and it wants there to be an Off switch and it won’t try to eliminate the Off switch and it will let you press the Off switch, but it won’t jump ahead and press the Off switch itself? … And if it self-modifies, will it self-modify in such a way as to keep the Off switch? We’re trying to work on that. It’s not easy.”
I certainly don’t have any answers, and in fact find it a bit surprising and counterintuitive that the problem is so hard.
The logic does fairly quickly become straightforward though. Imagine an AI designed to boil a tea kettle unless the emergency stop is pushed. If it is programmed to care more about starting the kettle than paying attention to the shutdown switch, then it will choose to boil water regardless of attempts at shutdown, or even to try to stop a person from using the switch. If it is programmed to value obeying the shutdown switch more then it becomes presented with the temptation to push the switch itself and thus achieve a higher value goal.