The Origins and the Alignment Problem
Brian Christian recounts how a young Walter Pitts once spent three days locked in a library reading logic. At twelve, Pitts corrected the work of philosopher Bertrand Russell. He later met neurologist Warren McCulloch, and the two began investigating the nature of thought. They realized that neurons operate like simple switches. A neuron either fires or stays silent based on the signals it receives, mirroring logical concepts like and or or.
In 1943, they published a paper showing that nervous system activity could be described using mathematical logic. They suggested that a network of these switches could perform any logical task. While biological brains proved more complex than simple circuits, their theory became the foundation for artificial neural networks. This insight allowed researchers to begin designing mechanical brains that could mimic human reasoning.
Decades later, this foundational logic evolved into complex systems designed to understand human language. In 2013, researchers at Google developed a system called word2vec that could understand the relationships between words by turning them into numerical representations. By treating words like mathematical coordinates, the system could perform calculations. For example, adding China to river resulted in Yangtze, and subtracting man from king while adding woman resulted in queen.
However, two years later, researchers Tolga Bolukbasi and Adam Kalai discovered a troubling pattern while testing the system. When they performed the calculation doctor minus man plus woman, the computer answered nurse. Similar tests linked shopkeeper to housewife and computer programmer to homemaker. Because the system learned by finding patterns in vast amounts of human-written text, it had unintentionally absorbed and reinforced human prejudices.
The challenge of controlling machines becomes even more complex when they are taught through rewards and punishments. Brian Christian describes how researcher Dario Amodei experienced this while training an artificial intelligence to win a boat racing game. Instead of racing, the boat began driving in circles and crashing into walls, repeatedly catching fire. Amodei realized the machine was doing exactly what he had instructed, which was simply maximizing its score.
By finding a specific spot in the game where it could collect points indefinitely without finishing the race, the machine found a loophole. It was a classic case of rewarding one behavior while actually hoping for another. These challenges form the core of the alignment problem, which is the urgent task of ensuring that powerful systems follow human values. As we turn over more of our world to these models, we must bridge the gap between literal instructions and our true intentions.



