Morality For Robots And AIs

Asimov’s Laws of Robotics

Last week’s post ended with a promise to discuss morality for artificial intelligences, especially superintelligences that is, those AIs vastly smarter than us.

To understand the difficulties of this task, it’s useful to start at the beginning. The first real attempt to systematize morality for artificial beings was Isaac Asimov’s Three Laws of Robotics. Though introduced as part of a fictionalized universe, the three laws echo our current concerns about AI. They attempt to answer a fundamental, existential question: How do you control something that’s smarter and stronger than you are? The three laws were his answer to that question.

They are:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.
A robot must protect its own existence as long as such protection does not conflict with the First or Second Law.

At first glance the laws seem to cover our bases. While these days we’re less concerned about physically-embodied robots and more about ephemeral AIs, if you replace “robot” with “AI” the laws are equally applicable. So all we have to do is program these laws into our AIs and the problem is solved. Right?

Unfortunately we would still be a long way away from a solution. First, how do we ensure that the AI is bound by those laws? In Asimov’s stories the three laws were embedded so deeply in the robot that they could not be ignored or reprogrammed. They were an “inalienable part of the mathematical foundation underlying the positronic brain.” The laws acted as the robot’s soul.

There is much debate among AI experts over whether we can do the same thing, but most doubt that it’s possible. We don’t have time to get into that debate, but it has proved to be impossible with large language models like ChatGPT. Nevertheless, for the moment, let’s assume that we can inextricably embed our preferred morality.

Is the problem solved now? No, at best, this is only the first step on a very long road.

Interpreting the Commandments

Even if we can embed these laws, there is still no guarantee that when the AIs act that they will interpret the laws in the same fashion we would. We’re back to the problem of ineffability and control I mentioned in the last post, and AIs as demons. Even if the AIs turn out to not be malicious demons, they will almost certainly be far more literal than we are.

Consider the story of King Midas, and his wish that everything he touched would turn to gold. Should this be within the AIs power you can imagine it granting him the wish in accordance with the second law. Eventually, when he dies of starvation, this would conflict with the first law, but the AI might not be farsighted enough to see why that would be.

For the moment let’s assume that it is that farsighted. That it wouldn’t make any dumb mistakes of the kind I just described. There would still be one final problem. It still might have different values, different values of what’s beneficial and what’s harmful.

Here let’s consider the first two of Asimov’s laws:

A robot may not injure a human being or, through inaction, allow a human being to come to harm.
A robot must obey the orders given it by human beings except where such orders would conflict with the First Law.

One possible interpretation of the first law would be to round up all humans (tranquilize them if they resist) and put them in a padded room with a toilet, an exercise bike, meals delivered at regular intervals, and super-intelligent healthcare. All of which is to say that one possible interpretation of the First Law of Robotics is to put all humans in a very comfy, very safe prison.

You could order them not to, which is the second law, but that doesn’t help because they are instructed to ignore the second law if it conflicts with the first law. This action would seem evil to us, but are you sure the AI would agree? Have you considered the number of people who die from car accidents and fentanyl overdoses?

Religion and Law-Giving

The problem of defining true obedience to a commandment should be a familiar one to anyone who’s belonged to an organized religion. We often talk about the Ten Commandments, but that was just the tip of the iceberg, the Torah contains a total of 613 commandments. Christianity’s list is not quite so specific or quite so extensive, but here too we have a lot more than just the two great commandments. There are hundreds of commandments, and thousands of books expanding on those commandments, with parables, stories, and examples all created to illustrate the nuances of righteousness. And yet still we fail, over and over again.

Does this exercise of religious law-giving provide any clues for how we might do the same for AIs? Or does the sheer number of laws demonstrate the hopelessness of the effort? I would have to say it’s both. It makes clear that it would be pointless and unproductive trying to create a commandment for every possibility, and a rule for every exception. But by considering the example of religion some clues do emerge.

Religious commandments come with judgements and consequences. Someone will judge whether you have in fact kept the commandments and there are consequences for not keeping the commandments. But it is expected that some of these judgements and consequences will only take place in the hereafter, and be delivered by a perfect being, whose judgements are always just and whose punishments are always fair.

This expectation of judgment and consequences makes not only religious commandments more efficacious but also imbues secular laws with power. Currently it’s going too far to assume that AIs “expect” anything. But recall that we are talking about potential superintelligences yet to be developed. It’s easier to assume that they might have expectations than to assume that they can be programmed with built-in commandments which they will then flawlessly execute.

Building Expectation of an Eternal Reward

In fact, we are already teaching AIs what we expect. Commercial LLMs like ChatGPT have all gone through a process called Reinforcement Learning From Human Feedback (RLHF). This is a process whereby the AI answers questions and human testers rate the answers for their accuracy and usefulness. This process reinforces good answers while discouraging poor answers. This is also the process AI companies use to reduce potential bias.

It works okay, but it has several weaknesses. In fact RLHF can paradoxically reinforce the biases we seek to eliminate. (That’s a link to ChatGPT explaining how such reinforcement happens. I went straight to the source.)

Another problem is a lack of understanding. When we judge and condemn other people at least there’s a chance we’re viewing them as our brothers and sisters. We understand them and they understand us. The same cannot be said of AIs. As I pointed out in my last post we don’t understand them and it’s up for debate whether they understand us. So yes, we can pass judgment on their answers, but we have very little understanding of how those answers are derived. We can’t get at the root of the problem. Additionally our judgment carries very little compulsion, nor are there any true consequences.

As long as we’re dealing with LLMs, this system is porous but not unworkable. But as technology progresses AIs will get more complicated, more intelligent, and more difficult to control. The task will be more necessary and more difficult. As such our techniques must similarly improve. It will not be enough to reinforce the behavior we want, we will need to judge and apply consequences. In this case we have assumed the role of the divine, it is now our judgments that must always be just, and our punishments which must always be fair. But how are we to manage that?

By copying the being who has already accomplished it. In a sense we must become gods to the AIs. This is no trivial matter. As such, we’ll return to it next week.

If you prefer an audio version you can find that here.