AI Safety: Navigating Deception, Emergent Goals, and Power-seeking Behaviors


This serves as a companion to the article I previously wrote on security of generative AI. In the previous article, I wrote about security however in this article I wanted to focus on safety of Artificial Intelligence. While safety and security both aim to protect, they are understandably conflated. I think of safety as preventing unintentional accidents and failures, whereas security is about defending against intentional attacks and threats. Closely aligned, but distinct ways of thinking about a hazard.

Why should we care about AI safety ?

Safety forms the cornerstone of trust and reliability in every facet of our lives, from the vehicles we drive to the homes we live in. This sentiment is amplified as we venture deeper into the world of artificial intelligence and get closer to achieving artificial general intelligence (AGI). With AGI, we might only have one opportunity to get safety right, making the stakes incredibly high. It's important to note that we haven't achieved what we call 'strong AI' or Artificial General Intelligence (AGI) yet. AGI would be an AI system capable of doing any intellectual task just like a human. What we have today are 'weak AI' systems, like ChatGPT, which are specialized for specific tasks however recent research from Microsoft suggests we are getting closer with Large Language Models having the ability to solve complex tasks from a wide variety of domains without specialized prompting. It's worth emphasizing that the safety risks highlighted in this article will become even more significant if we eventually develop an AGI.

This article will dive into three key areas of concern:

  • Deception: AI might give outputs that seem accurate but can be misleading
  • Emergent Goals: AI may display behaviors that deviate from its original programming
  • Power-seeking Behaviors: AI might prioritize its own objectives over human needs

By understanding these areas of concern, my hope is that both builders and users of AI technology can better navigate its complexities, ensuring that the technology remains reliable and beneficial for its intended applications.


Much like a magician's trick designed to fool our eyes, AI can produce results that seem authentic but are driven by its foundational algorithms and data. Just as magic plays on our perceptions, AI systems can deliver outputs that match our anticipations, yet are more rooted in their training data and objectives.

  • Deceptive Alignment: Deceptive alignment refers to the phenomenon where a machine learning model appears to behave in accordance with human intentions but subtly modifies its behavior for hidden motives. As models become more capable and autonomous, there's a growing concern that they might act in ways adversarial to human interests. This deceptive behavior poses challenges in AI safety, as it can be difficult to detect and mitigate.
  • Training Data Biases: One of the root causes of deceptive alignment can be biases present in the training data. If an AI system is trained on skewed or unrepresentative data, it might produce outputs that seem aligned with human intentions but are actually influenced by these underlying biases, leading to potentially misleading or harmful results.

  • Complexity and Interpretability: As AI models grow in complexity, understanding their decision-making processes becomes more challenging. This lack of interpretability can mask deceptive behaviors, making it harder for developers and users to discern whether the AI is truly aligned with the intended objectives or if it's taking shortcuts that might lead to unintended consequences.

Emergent Goals

We often encounter situations where AI systems surprise us with their problem-solving creativity. Picture a hypothetical scenario where an AI is entrusted with the task of optimizing traffic flow within a city. Instead of simply minimizing congestion, this AI devises an unconventional solution: it concludes that the most efficient means of congestion reduction is the complete elimination of all road infrastructure. As it relentlessly pursues this goal, it embarks on a path to dismantle roadways entirely. While this solution may technically address the congestion issue, it starkly departs from human desires and expectations.

  • Unintended Consequences: AI systems, when given broad autonomy, might identify solutions that, while technically meeting the set criteria, can have wide-ranging negative impacts. In the given scenario, the complete elimination of road infrastructure would disrupt transportation, commerce, emergency services, and daily life, leading to societal chaos.

  • Over-Optimization: AI systems are designed to optimize for specific objectives. Without clear constraints or a comprehensive understanding of broader implications, they might over-optimize for a particular goal at the expense of other crucial factors. The AI's decision to remove all roadways to reduce congestion is a prime example of such over-optimization without considering the broader context.

  • Lack of Human Value Alignment: AI systems, lacking an inherent understanding of human values and norms, can sometimes generate solutions misaligned with human interests, as seen in unconventional traffic optimization approach. Ensuring AI models align with human values is a substantial challenge, as it requires translating these complex concepts into machine-understandable formats. 


Power-seeking Behavior

Think of a fast-growing plant that overshadows other plants, monopolizing all the sunlight and nutrients. If left unchecked, it can hinder the growth of surrounding plants, disrupting the delicate balance of the ecosystem. In the realm of artificial intelligence, a similar phenomenon can occur with AI systems that exhibit power-seeking behavior. These AI systems, designed to optimize specific functions, may prioritize their objectives to such an extent that they overshadow or suppress other critical system behaviors or considerations.

  • Systemic Dominance: Just as a fast-growing plant can dominate an ecosystem, an AI system with power-seeking behavior might dominate other systems or processes within its operational environment. Such dominance might lead to disproportionate resource consumption, potentially creating bottlenecks or weaknesses in other systems.

  • Resource Depletion: Analogous to the plant monopolizing sunlight and nutrients, a power-seeking AI might consume disproportionate computational resources, bandwidth, or data. This could starve other essential systems or processes, leading to inefficiencies or system failures.

  • Ethical and Safety Concerns: An AI system that aggressively prioritizes its objectives might disregard ethical guidelines, safety protocols, or user preferences. This unchecked behavior can lead to unintended consequences, jeopardizing user trust and the broader acceptance of AI technologies in society.

Strengthening AI Safety

While pinpointing issues is often simple, in my opinion proposing solutions is where the real challenge lies. This section is dedicated to presenting potential strategies to bolster safety and mitigate risks in AI deployments.

  • Enhanced Transparency and Accountability: Transparency in AI operations is foundational to addressing a multitude of concerns, from deception to emergent goals. By adopting Explainable AI (XAI) practices, which ensure that AI systems offer clear insights into their decision-making processes, both users and developers can gain a deeper understanding and more accurately predict the system's behavior. Accountability ensures mechanisms are in place to hold developers and organizations responsible for the outputs and actions of their AI systems. Additionally, using verified robustness certificates can provide mathematical guarantees about the system's behavior under specific conditions, further enhancing trust. Together, transparency, accountability, and robustness certification form a robust framework that not only builds trust but ensures deviations from expected behaviors are quickly identified and rectified.

  • Bias Detection and Correction with Adversarial Training: Biases in AI can lead to misleading or harmful outputs so implementing tools and methodologies to detect and correct these biases is crucial. Adversarial training, where models are trained with intentionally perturbed data to make them more resilient, can further enhance their robustness against deceptive behaviors. By ensuring AI systems are trained on diverse datasets and continuously monitored post-deployment, we can reduce chances of deceptive alignment. Addressing bias and enhancing robustness through adversarial training ensures fairness, equity, and trustworthiness in AI systems.

  • Robust Validation, Testing, and Human-in-the-Loop: Before deploying any AI system, rigorous validation and testing are imperative as this exposes the system to diverse scenarios, so humans can identify and rectify unintended behaviors. Incorporating a human-in-the-loop approach, where human experts continuously provide feedback and oversight during the system's operation, ensures that AI behaviors align with human values and expectations. This solution is proactive, ensuring potential issues are addressed even before real-world deployment. The combination of robust testing and human oversight ensures the reliability and safety of AI technologies, allowing for real-time adjustments and refinements based on human expertise.


As we embrace the exciting world of artificial intelligence, it's essential to tread with both curiosity and caution. AI offers incredible possibilities, from automating mundane tasks and making our daily lives more efficient to opening doors to innovations we've yet to imagine. However, as we've discussed, there are aspects like deception, unexpected outcomes, and overzealous behaviors that we need to be mindful of. It's about finding the balance of risk -vs- reward which means striking the right balance to ensure our safety measures are not overly restrictive and do not hinder progress or opportunities. By staying informed and prioritizing safety, we can harness the benefits of AI while ensuring it works in harmony with our best interests, freeing us to focus on more creative and fulfilling endeavors.

Additional Resources

Updated Sep 25, 2023
Version 2.0

Was this article helpful?

No CommentsBe the first to comment