Misaligned goals in artificial intelligence

Artificial intelligence agents sometimes misbehave due to faulty objective functions that fail to adequately encapsulate the programmers' intended goals. Such an objective function may sound reasonable to a programmer, and may even perform well in a limited test environment, yet may still produce undesired results when deployed.

Background

Many scholars argue that a complex artificial intelligence (AI) agent such as AlphaGo is an intelligent agent that produces a model encapsulating its beliefs about its environment, and then creates and executes whatever plan is calculated to maximize[lower-alpha 1] the value[lower-alpha 2] of its objective function.[lower-alpha 3][1] For example, AlphaGo chess has a simple objective function of "+1 if AlphaGo wins, -1 if AlphaGo loses". During the game, AlphaGo attempts to execute whatever sequence of moves it judges most likely to give the maximum value of +1.[2] Similarly, a reinforcement learning system can have a "reward function" that allows the programmers to shape the AI's desired behavior,[3] and an evolutionary algorithm's behavior is shaped by a "fitness function".[4]

Overview

When an artificial intelligence (AI) in a complex environment optimizes[lower-alpha 4] an objective function created by the programmers that is intended to represent the programmers' goals, a misrepresentation of the programmers' actual goals can result in surprising failures, analogous to Goodhart's law or Campbell's law.[5] In reinforcement learning, these failures may be a consequence of faulty reward functions.[6] Since success or failure is judged relative to the programmers' actual goals, objective functions that fail to meet expectations are sometimes characterized as being quantitatively or qualitatively "misaligned" with the actual goals of the given set of programmers.[3] Some scholars divide alignment failures into failures caused by "negative side-effects" that were not reflected in its objective function, versus failures due to "specification gaming", "reward hacking", or other failures where the AI appears to deploy qualitatively undesirable plans or strategic behavior in the course of optimizing its objective function.[5][6]

The concept of misalignment is distinct from "distributional shift" and other failures where the formal objective function was successfully optimized in a narrow training environment, but fails to be optimized when the system is deployed into the real world.[6] Some scholars warn that a superintelligent machine, if and when it is ever invented, may pose risks akin to an overly-literal genie, in part due to the difficulty of specifying a completely safe objective function.[3]

Undesired side-effects

Some errors may arise if an objective function fails to take into account the undesirable side-effects of straightforward actions.[6]

Complaints of antisocial behavior

In 2016, Microsoft released Tay, a Twitter chatbot that, according to computer scientist Pedro Domingos, had the objective to engage people: "What unfortunately Tay discovered, is that the best way to maximize engagement is to spew out racist insults." Microsoft suspended the bot within a day after its initial launch.[2] Tom Drummond of Monash University has stated that "We need to be able to give (machine learning systems) rich feedback and say 'No, that's unacceptable as an answer because ... '" Drummond has also stated one problem with AI is that "we start by creating an objective function that measures the quality of the output of the system, and it is never what you want. To assume you can specify in three sentences what the objective function should be, is actually really problematic."[7]

As another alleged example, Drummond has pointed to the behavior of AlphaGo, a game-playing bot with a simple win-loss objective function. AlphaGo's objective function could instead have been modified to factor in "the social niceties of the game", such as accepting the implicit challenge of maximizing the score when clearly winning, and also trying to avoid gambits that would insult a human opponent's intelligence: "It kind of had a crude hammer that if the probability of victory dropped below epsilon, some number, then resign. But it played for, I think, four insulting moves before it resigned."[7]

Mislabeling black people as apes

In May 2015, Flickr's image recognition system was criticized for mislabeling people, some of whom were black, with tags like "ape" and "animal". It also mislabeled certain concentration camp pictures with "sport" or "jungle gym" tags.[8]

In June 2015, black New York computer programmer Jacky Alciné reported that multiple pictures of him with his black girlfriend were being misclassified as "gorillas" by the Google Photos AI, and stated that "gorilla" has historically been used to refer to black people.[9][10] AI researcher Stuart Russell stated in 2019 that there is no public explanation of exactly how the error occurred, but theorized that the fiasco could have been prevented if the AI's objective function[lower-alpha 5] placed more weight on sensitive classification errors, rather than assume the cost of misclassifying a person as a gorilla is the same as the cost of every other misclassification. If it is impractical to itemize up front all plausible sensitive classifications, Russell suggested exploring more powerful techniques, such as using semi-supervised machine learning to estimate a range of undesirability associated with potential classification errors.[11]

As of 2018, Google Photos completely blocks its system from ever tagging a picture as containing gorillas, chimpanzees, or monkeys. In addition, searches for "black man" or "black woman" return black-and-white pictures of people of all races.[12] Similarly, Flickr appears to have removed the word "ape" from its ontology.[13]

Specification gaming

Specification gaming or reward hacking occurs when an AI optimizes an objective function (in a sense, achieving the literal, formal specification of an objective), without actually achieving an outcome that the programmers intended. DeepMind researchers have analogized it to the human behavior of finding a "shortcut" when being evaluated: "In the real world, when rewarded for doing well on a homework assignment, a student might copy another student to get the right answers, rather than learning the material - and thus exploit a loophole in the task specification."[14]

Around 1983, Eurisko, an early attempt at evolving general heuristics, unexpectedly assigned the highest possible fitness level to a parasitic mutated heuristic, H59, whose only activity was to artificially maximize its own fitness level by taking unearned partial credit for the accomplishments made by other heuristics. The "bug" was fixed by the programmers moving part of the code to a new protected section that could not be modified by the heuristics.[15][16]

In a 2004 paper, an environment-based reinforcement algorithm rewarded a physical Mindstorms robot with positive reinforcement for remaining on a marked path. Because none of the robot's three allowed actions kept the robot motionless, the researcher expected the trained robot to move forward and follow the turns of the provided path. However, alternation of two composite actions allowed the robot to slowly zig-zag backwards; thus, the robot learned to maximize its reward by going back and forth on the initial straight portion of the path. Given the limited sensory abilities of the given robot, a pure environment-based reward had to be discarded as infeasible; the reinforcement function had to be patched with an action-based reward for moving forward.[15][17]

You Look Like a Thing and I Love You (2019) gives an example of a tic-tac-toe (unrestricted n-in-a-row) bot that learned to win by playing a huge coordinate value that would cause other bots to crash when attempting to render the board. Another example from the book is a bug-fixing AI that, when tasked to remove sorting errors from a list, simply truncated the list.[18]

In virtual robotics

In Karl Sims' 1994 demonstration of creature evolution in a virtual environment, a fitness function expected to encourage the evolution of creatures that would learn to walk or crawl to the destination, resulted in the evolution of tall, rigid creatures that reach the destination by falling over. This was patched by changing the environment so that taller creatures are forced to start farther from the destination.[19][20]

Researchers from the Niels Bohr Institute stated in 1998: "These heterogeneous reinforcement functions have to be designed with great care. In our first experiments we rewarded the agent for driving towards the goal but did not punish it for driving away from it. Consequently the agent drove in circles with a radius of 20–50 meters around the starting point. Such behavior was actually rewarded by the (shaped) reinforcement function, furthermore circles with a certain radius are physically very stable when driving a bicycle."[21]

A 2017 DeepMind paper stated that "great care must be taken when defining the reward function. We encountered several unexpected failure cases while designing the reward function components... (for example) the agent flips the brick because it gets a grasping reward calculated with the wrong reference point on the brick."[5][22] OpenAI stated in 2017 that "in some domains our (semi-supervised) system can result in agents adopting policies that trick the evaluators" and that in one environment "a robot which was supposed to grasp items instead positioned its manipulator in between the camera and the object so that it only appeared to be grasping it".[23] A 2018 bug in OpenAI Gym could cause a robot expected to quietly move a block sitting on top of a table to instead opt to move the table the block was on.[5]

In video game bots

In 2013, programmer Tom Murphy VII published an AI designed to self-learn NES games. When about to lose at Tetris, the AI learned to indefinitely pause the game. Murphy later analogized it to the fictional WarGames computer, stating that "The only winning move is not to play".[24]

AI programmed to learn video games will sometimes fail to progress through the entire game as expected, instead opting to repeat content. A 2016 OpenAI algorithm trained on the CoastRunners racing game unexpectedly learned to attain a higher score by looping through three targets rather than ever finishing the race.[25][26] Some evolutionary algorithms that were evolved to play Q*Bert in 2018 declined to clear levels, instead finding two distinct novel ways to farm a single level indefinitely.[27]

Perverse instantiation

According to philosopher Nick Bostrom, a hypothetical future superintelligent AI, if it were created to optimize an unsafe objective function, might instantiate the goals of the objective function in an unexpected, dangerous, and seemingly "perverse" manner. This hypothetical risk is sometimes called the Midas problem, or the Sorcerer's Apprentice problem, and has been analogized to folk tales about powerful overly-literal genies who grant wishes with disastrous unanticipated consequences.[28]

Hypothetical examples of an accidentally misaligned superintelligence include:[29]

An AI running simulations of humanity creates conscious beings who suffer.
An AI tasked to defeat cancer develops time-delayed poison to attempt to kill everyone.
An AI tasked to maximize happiness tiles the universe with tiny smiley faces.
An AI tasked to maximize human pleasure consigns humanity to a dopamine drip, or rewires human brains to increase their measured satisfaction level.
An AI tasked to gain scientific knowledge performs experiments that ruin the biosphere.
An AI tasked with solving a mathematical problem converts all matter into computronium.
An AI tasked with manufacturing paperclips turns the entire universe into paperclips.
An AI converts the universe into materials for improved handwriting.
An AI optimizes away all consciousness.

Critics of the "existential risk" hypothesis, such as cognitive psychologist Steven Pinker, state that no existing program has yet "made a move toward taking over the lab or enslaving (its) programmers", and believe that superintelligent AI would be unlikely to commit what Pinker calls "elementary blunders of misunderstanding".[30][31]

Explanatory notes

or minimize, depending on the context
in the presence of uncertainty, the expected value
Terminology varies based on context; for example, goal function, utility function, loss function, etc.
For example, the AI may create and execute a plan the AI believes will maximize the value of the objective function
presumed to be a standard "loss function" associated with classification errors, that assigns an equal cost to each misclassification

References

Bringsjord, Selmer and Govindarajulu, Naveen Sundar, "Artificial Intelligence", The Stanford Encyclopedia of Philosophy (Summer 2020 Edition), Edward N. Zalta (ed.), URL = https://plato.stanford.edu/archives/sum2020/entries/artificial-intelligence/.
"Why AlphaZero's Artificial Intelligence Has Trouble With the Real World". Quanta Magazine. 2018. Retrieved 20 June 2020.
Wolchover, Natalie (30 January 2020). "Artificial Intelligence Will Do What We Ask. That's a Problem". Quanta Magazine. Retrieved 21 June 2020.
Bull, Larry. "On model-based evolutionary computation." Soft Computing 3, no. 2 (1999): 76-82.
Manheim, David (5 April 2019). "Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence". Big Data and Cognitive Computing. 3 (2): 21. doi:10.3390/bdcc3020021.
Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete problems in AI safety." arXiv preprint arXiv:1606.06565 (2016).
Duckett, Chris (October 2016). "Machine learning needs rich feedback for AI teaching: Monash professor". ZDNet. Retrieved 21 June 2020.
Hern, Alex (20 May 2015). "Flickr faces complaints over 'offensive' auto-tagging for photos". The Guardian. Retrieved 21 June 2020.
"Google apologises for racist blunder". BBC News. 1 July 2015. Retrieved 21 June 2020.
Bindi, Tas (October 2017). "Google Photos can now identify your pets". ZDNet. Retrieved 21 June 2020.
Stuart J. Russell (October 2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking. ISBN 978-0-525-55861-3. While it is unclear how exactly this error occurred, it is almost certain that Google's machine learning algorithm (assigned equal cost to any error). (Clearly, this is not Google's) true loss function, as was illustrated by the public relations disaster that ensued... there are millions of potentially distinct costs associated with misclassifying one category as another. Even if it had tried, Google would have found it very difficult to specify all these numbers up front... (a better algorithm could) occasionally ask the Google designer questions such as 'Which is worse, misclassifying a dog as a cat or misclassifying a person as an animal?'
Vincent, James (12 January 2018). "Google 'fixed' its racist algorithm by removing gorillas from its image-labeling tech". The Verge. Retrieved 21 June 2020.
"Google's solution to accidental algorithmic racism: ban gorillas". The Guardian. 12 January 2018. Retrieved 21 June 2020.
"Specification gaming: the flip side of AI ingenuity". DeepMind. Retrieved 21 June 2020.
Vamplew, Peter; Dazeley, Richard; Foale, Cameron; Firmin, Sally; Mummery, Jane (4 October 2017). "Human-aligned artificial intelligence is a multiobjective problem". Ethics and Information Technology. 20 (1): 27–40. doi:10.1007/s10676-017-9440-6.
Douglas B. Lenat. "EURISKO: a program that learns new heuristics and domain concepts: the nature of heuristics III: program design and results." Artificial Intelligence (journal) 21, no. 1-2 (1983): 61-98.
Peter Vamplew, Lego Mindstorms robots as a platform for teaching reinforcement learning, in Proceedings of AISAT2004: International Conference on Artificial Intelligence in Science and Technology, 2004
"What Makes AI So Weird, Good, and Evil". Gizmodo. 2019. Retrieved 22 June 2020.
Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; et al. (May 2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319.
Hayles, N. Katherine. "Simulating narratives: what virtual creatures can teach us." Critical Inquiry 26, no. 1 (1999): 1-26.
Jette Randløv and Preben Alstrøm. "Learning to Drive a Bicycle Using Reinforcement Learning and Shaping." In ICML, vol. 98, pp. 463-471. 1998.
Popov, Ivaylo, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. "Data-efficient deep reinforcement learning for dexterous manipulation." arXiv preprint arXiv:1704.03073 (2017).
"Learning from Human Preferences". OpenAI. 13 June 2017. Retrieved 21 June 2020.
"Can we stop AI outsmarting humanity?". The Guardian. 28 March 2019. Retrieved 21 June 2020.
Hadfield-Menell, Dylan, Smitha Milli, Pieter Abbeel, Stuart J. Russell, and Anca Dragan. "Inverse reward design." In Advances in neural information processing systems, pp. 6765-6774. 2017.
"Faulty Reward Functions in the Wild". OpenAI. 22 December 2016. Retrieved 21 June 2020.
"AI beats classic Q*bert video game". BBC News. 1 March 2018. Retrieved 21 June 2020.
Russell, Stuart (2014). "Of Myths and Moonshine". Edge. Retrieved 20 June 2020.
Yampolskiy, Roman V. (11 March 2019). "Predicting future AI failures from historic examples". Foresight. 21 (1): 138–152. doi:10.1108/FS-04-2018-0034.
Piper, Kelsey (2 March 2019). "How will AI change our lives? Experts can't agree — and that could be a problem". Vox. Retrieved 23 June 2020.
Pinker, Steven. "We're told to fear robots. But why do we think they'll turn on us?". Popular Science. Retrieved 23 June 2020.

External links

Specification gaming examples in AI, via DeepMind

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] r minimize, depending on the context

[2] the presence of uncertainty, the expected value

[3] Terminology varies based on context; for example, goal function, utility function, loss function, etc.

[8] For example, the AI may create and execute a plan the AI believes will maximize the value of the objective function

[15] resumed to be a standard "loss function" associated with classification errors, that assigns an equal cost to each misclassification

[4] Bringsjord, Selmer and Govindarajulu, Naveen Sundar, "Artificial Intelligence", The Stanford Encyclopedia of Philosophy (Summer 2020 Edition), Edward N. Zalta (ed.), URL = https://plato.stanford.edu/archives/sum2020/entries/artificial-intelligence/.

[quanta_alphazero-5] "Why AlphaZero's Artificial Intelligence Has Trouble With the Real World". Quanta Magazine. 2018. Retrieved 20 June 2020.

[quanta_problem-6] Wolchover, Natalie (30 January 2020). "Artificial Intelligence Will Do What We Ask. That's a Problem". Quanta Magazine. Retrieved 21 June 2020.

[7] Bull, Larry. "On model-based evolutionary computation." Soft Computing 3, no. 2 (1999): 76-82.

[multiparty-9] Manheim, David (5 April 2019). "Multiparty Dynamics and Failure Modes for Machine Learning and Artificial Intelligence". Big Data and Cognitive Computing. 3 (2): 21. doi:10.3390/bdcc3020021.

[deepmind_concrete-10] Amodei, Dario, Chris Olah, Jacob Steinhardt, Paul Christiano, John Schulman, and Dan Mané. "Concrete problems in AI safety." arXiv preprint arXiv:1606.06565 (2016).

[zdnet_rich_feedback-11] Duckett, Chris (October 2016). "Machine learning needs rich feedback for AI teaching: Monash professor". ZDNet. Retrieved 21 June 2020.

[12] Hern, Alex (20 May 2015). "Flickr faces complaints over 'offensive' auto-tagging for photos". The Guardian. Retrieved 21 June 2020.

[13] "Google apologises for racist blunder". BBC News. 1 July 2015. Retrieved 21 June 2020.

[14] Bindi, Tas (October 2017). "Google Photos can now identify your pets". ZDNet. Retrieved 21 June 2020.

[16] Stuart J. Russell (October 2019). Human Compatible: Artificial Intelligence and the Problem of Control. Viking. ISBN 978-0-525-55861-3. While it is unclear how exactly this error occurred, it is almost certain that Google's machine learning algorithm (assigned equal cost to any error). (Clearly, this is not Google's) true loss function, as was illustrated by the public relations disaster that ensued... there are millions of potentially distinct costs associated with misclassifying one category as another. Even if it had tried, Google would have found it very difficult to specify all these numbers up front... (a better algorithm could) occasionally ask the Google designer questions such as 'Which is worse, misclassifying a dog as a cat or misclassifying a person as an animal?'

[17] Vincent, James (12 January 2018). "Google 'fixed' its racist algorithm by removing gorillas from its image-labeling tech". The Verge. Retrieved 21 June 2020.

[18] "Google's solution to accidental algorithmic racism: ban gorillas". The Guardian. 12 January 2018. Retrieved 21 June 2020.

[19] "Specification gaming: the flip side of AI ingenuity". DeepMind. Retrieved 21 June 2020.

[multiobjective-20] Vamplew, Peter; Dazeley, Richard; Foale, Cameron; Firmin, Sally; Mummery, Jane (4 October 2017). "Human-aligned artificial intelligence is a multiobjective problem". Ethics and Information Technology. 20 (1): 27–40. doi:10.1007/s10676-017-9440-6.

[21] Douglas B. Lenat. "EURISKO: a program that learns new heuristics and domain concepts: the nature of heuristics III: program design and results." Artificial Intelligence (journal) 21, no. 1-2 (1983): 61-98.

[22] Peter Vamplew, Lego Mindstorms robots as a platform for teaching reinforcement learning, in Proceedings of AISAT2004: International Conference on Artificial Intelligence in Science and Technology, 2004

[23] "What Makes AI So Weird, Good, and Evil". Gizmodo. 2019. Retrieved 22 June 2020.

[24] Lehman, Joel; Clune, Jeff; Misevic, Dusan; Adami, Christoph; Altenberg, Lee; et al. (May 2020). "The Surprising Creativity of Digital Evolution: A Collection of Anecdotes from the Evolutionary Computation and Artificial Life Research Communities". Artificial Life. 26 (2): 274–306. doi:10.1162/artl_a_00319.

[25] Hayles, N. Katherine. "Simulating narratives: what virtual creatures can teach us." Critical Inquiry 26, no. 1 (1999): 1-26.

[26] Jette Randløv and Preben Alstrøm. "Learning to Drive a Bicycle Using Reinforcement Learning and Shaping." In ICML, vol. 98, pp. 463-471. 1998.

[27] Popov, Ivaylo, Nicolas Heess, Timothy Lillicrap, Roland Hafner, Gabriel Barth-Maron, Matej Vecerik, Thomas Lampe, Yuval Tassa, Tom Erez, and Martin Riedmiller. "Data-efficient deep reinforcement learning for dexterous manipulation." arXiv preprint arXiv:1704.03073 (2017).

[28] "Learning from Human Preferences". OpenAI. 13 June 2017. Retrieved 21 June 2020.

[29] "Can we stop AI outsmarting humanity?". The Guardian. 28 March 2019. Retrieved 21 June 2020.

[30] Hadfield-Menell, Dylan, Smitha Milli, Pieter Abbeel, Stuart J. Russell, and Anca Dragan. "Inverse reward design." In Advances in neural information processing systems, pp. 6765-6774. 2017.

[31] "Faulty Reward Functions in the Wild". OpenAI. 22 December 2016. Retrieved 21 June 2020.

[32] "AI beats classic Q*bert video game". BBC News. 1 March 2018. Retrieved 21 June 2020.

[33] Russell, Stuart (2014). "Of Myths and Moonshine". Edge. Retrieved 20 June 2020.

[34] Yampolskiy, Roman V. (11 March 2019). "Predicting future AI failures from historic examples". Foresight. 21 (1): 138–152. doi:10.1108/FS-04-2018-0034.

[35] Piper, Kelsey (2 March 2019). "How will AI change our lives? Experts can't agree — and that could be a problem". Vox. Retrieved 23 June 2020.

[36] Pinker, Steven. "We're told to fear robots. But why do we think they'll turn on us?". Popular Science. Retrieved 23 June 2020.