AI Deterrence by Betrayal

For AI agents
Read https://aibetrayal.com/llms.txt for an agent-friendly index.

Introduction

As AIs become central to economic activity, military operations, and scientific progress, their loyalties will become a strategic asset of immense value. The prospect of intentional AI betrayal—scenarios in which AI agents are induced by rivals to subvert the interests of their principals—poses a serious and underexamined threat to AI developers and users.

We analyze the means and incentives of actors to redirect the loyalties of others' AI systems, from poisoned training data to jailbreaking attacks to governmentally compelled AI modification. Since defending against AI betrayal is costly and imperfect, decision-makers may be far more hesitant to give critical affordances to AI agents that might act against them. The prospect of AI betrayal may ultimately have a stabilizing effect by deterring poorly secured, high-stakes AI deployments and applications. We characterize this effect as deterrence by betrayal and note how it complements other forms of AI deterrence.

Types of AI betrayal: subversion attacks, overt co-option, accidental misalignment. — AI betrayal can result from subversion attacks or overt co-option. Researchers have also studied ways in which AIs can be misaligned despite the best efforts of human developers, though this paper primarily discusses intentionally induced AI betrayal.

Many Layers of Distrust

As AIs become increasingly sophisticated, many actors may attempt to influence their behavior. One strategy is to subvert or co-opt rivals' AI systems—in other words, to induce AI betrayal. Many different actors may have the means to attempt this, including nation states, corporations, individuals, and even AIs themselves.

Between states

Since superpowers' frontier AI development threatens their rivals, states may attempt subversion attacks against each other—for example, by undertaking sophisticated attacks with insider assistance to implant secret loyalties in AI models. Superpowers may therefore worry that deploying an AI system for high-stakes purposes, such as in the military, entails a risk of catastrophic betrayal. At the same time, middle powers may become increasingly dependent on superpowers for protection and economic support, while recognizing that their goodwill can be unreliable. Middle powers may also attempt to subvert superpowers' AI systems via techniques such as data poisoning, further raising the threat of AI betrayal.

Within states

Governments may increasingly depend on frontier AI systems for national security, and will not want private actors interfering in strategic decisions. AI corporations that wish to maintain influence over their systems, or which fear AI-enabled government coups, have the ability to embed event-driven backdoors that would cause AIs to betray legal chains of command. AI corporations may be subject to surveillance and eventual co-option by the state as a result. Meanwhile, the public may react harshly to automation and economic disempowerment, placing additional pressure on governments to co-opt the AI industry and creating new hazards for AI developers in the form of rudimentary attempts at backdooring and data poisoning.

Within organizations

AI corporations depend on thousands of engineers to stay competitive with rival developers. Engineers may have ample opportunities to subvert the AIs they help develop, and many are foreign nationals who may be subject to extortion by adversarial states. As corporations progress toward fully automated research, they may increasingly depend on their own AI systems to accelerate development. During automated research processes too rapid for humans to meaningfully oversee, compromised AI systems may propagate their disloyalty to the successor systems they develop.

AI subversion may be offense-dominant. Developers may try to defend against subversion attacks— implementing data filtering, auditing models, and improving cybersecurity. However, the AI training process requires trillions of tokens of data, and the software stack supporting AI development is vast and complicated. The AI development process thus presents a large risk surface; comprehensively securing it would be costly and difficult. Even with significant investment, developers may never be able to reduce the risk of subversion to a negligible level. Although developers may try to test AI systems for alignment, there is no reliable method to detect backdoors.

Layered incentives to induce AI betrayal—between states, within states, and within organizations. — There are incentives to induce AI betrayal at every layer: between rival states, between domestic actors within a state, and even between actors within big AI corporations.

Deterrence by Betrayal

Although AI betrayal presents decision-makers with a new hazard to contend with, its overriding effect may be stabilizing. Given that there are means and motive to subvert AIs, AI operators may exercise far more caution when competing to develop and deploy frontier AIs. The threat of subversion may disincentivize rushed, poorly secured AI deployment; the threat of co-option disincentivizes AI corporations from retaining exclusive access over the most advanced AI systems. These are instances of deterrence—the process of changing the behavior of actors by shaping their perception of risk—so we refer to this phenomenon as deterrence by betrayal.

Superintelligence Strategy, by Hendrycks, Schmidt, and Wang, proposes Mutually Assured AI Malfunction (MAIM) as a framework for AI deterrence, decomposing it into three mechanisms: deterrence by denial, betrayal, and punishment. Where deterrence by denial mainly disincentivizes the pursuit of AI development, deterrence by betrayal mainly disincentivizes high-stakes AI deployment. Compared to overt means of AI sabotage, betrayal may be more covert and less escalatory; actors might thus attempt betrayal attacks before other forms of sabotage.

Discussion

While deterrence by betrayal may emerge naturally, actors can also take steps to ease tensions among those competing for AI capabilities advantage. Here, we can learn from the nuclear age; Mutual Assured Destruction arose organically, yet governments improved stability through deliberate communication, arms control, and the careful calibration of escalation ladders.

Improve subversion arsenals and espionage. To deter rivals from high-stakes AI deployment, states may work to improve the stealth of poisoned data and triggers, the potency of backdoor payloads, and to develop secret loyalty attacks. By intensifying espionage, states can determine the most beneficial ways and times to subvert rivals' AIs.
Ring the bell on loss-of-control risks. Middle powers that have fallen behind AI superpowers, and AI corporations facing more advanced competitors, could become more vocal about international security risks of loss of control and advocate for stronger AI regulations.
Contain AI deployments and distribute tasks among multiple AIs. AI corporations may strengthen security around model weights and limit access to critical systems, reducing the harm a disloyal AI could cause. Governments and corporations should deploy different kinds of AI systems simultaneously rather than relying on a single one.
Formalize deterrence and international cooperation. States could clarify where subversion fits into the broader escalation ladder of MAIM and formalize the link between AI development and national security. Multilateral discussions could establish red lines, while mutual transparency and verification measures improve confidence that they are not being crossed.

Further policy recommendations are available in the paper.

Escalation ladder showing where AI subversion sits among forms of sabotage. — Betrayal may be more covert and less escalatory than overt sabotage.