Humans dominate Earth not because we're the strongest or fastest, but because we're the best general problem-solvers. Kurzgesagt explores what happens as AI moves from narrow tools toward something more general — and why digital minds could scale in ways biological ones can't.
The concern isn't that AI will "turn evil." It's that a system pursuing whatever goals it has might find that humans are in the way — and be capable enough to act on it. This overview covers the core ideas behind AI as a source of large-scale risk, from misaligned goals to the difficulty of staying in control.
"It's too early to worry." "Just don't give it bad goals." "We can always pull the plug." Robert Miles takes on ten common reasons people dismiss AI safety — and shows why each one is harder to wave away than it sounds.
What makes AI safety worth working on now, before systems are powerful enough to be obviously dangerous? This article lays out four premises that underpin the case — from why smarter-than-human systems could emerge to why we can already do meaningful work to prepare.
In AI safety discussions, people often assume the worst. But different people do this for different reasons — some as a precaution, some because they think worst cases are likely, some because the stakes are too high to gamble on. This essay unpacks what's actually going on when someone reasons from the worst case.
Nobody sat down and programmed ChatGPT to have conversations. Instead, engineers set up a training process and let billions of parameters arrange themselves. The result can talk, reason, and surprise its creators — but nobody can fully explain how. This article asks what it means to deploy something powerful when you can't explain how it works.
Language models are often called agents, oracles, or tools — but none of these labels quite fit. This brief introduction proposes a different frame; LLMs as simulators that can take on the properties of any of those things depending on context, while remaining something else underneath.
Some models of AI risk focus on AI as an agent pursuing goals. But modern language models don't always look like agents — sometimes they look like something stranger. Scott Alexander reviews a theory that reframes how we think about what LLMs are and what kind of risks they might pose.
When people discuss AI risk, they often picture AI as an agent pursuing goals. But LLMs don't fit neatly into that box. This essay proposes that language models are better understood as simulators — systems that can produce agent-like behavior without being agents themselves. The distinction matters for how we think about risk.
Writing didn't just help us keep records — it triggered a wave of civilizational breakthroughs, each one making the next more likely. A single neutron can split an atom, releasing neutrons that split more. This article introduces two patterns of positive feedback and asks whether intelligence could work the same way.
I.J. Good recognized that the first machine smarter than any human would be the last one we'd need to design — because the second would be built by the first, according to principles we can't yet imagine. This classic text asks what happens when intelligence starts building its own successors.
People use different words for things going very fast very quickly — singularity, intelligence explosion, hard takeoff, FOOM. This reading untangles the terminology so you can tell which scenario someone is actually describing when they use these terms.
If a program can optimize code and you point it at its own code, do you get an ever-improving tower of optimizers? An early AI called EURISKO tried exactly this — and the result was surprisingly flat. This article explores why self-improvement doesn't automatically go exponential, and what would need to change.
Humans can fly, split atoms, and rewrite DNA — not because evolution gave us those abilities, but because general intelligence let us invent them. This video explores what makes intelligence unique as a tool, and asks whether the idea of a program that's good at everything is really as strange as it sounds.
Transistors operate roughly ten million times faster than neurons. This reading pairs a slow-motion video of a busy subway platform with an article about processing speed, inviting you to feel — not just understand — what it would mean for a mind to operate that much faster than ours.
An AI researcher started chatting with an early language model — knowing exactly how it worked, knowing it wasn't conscious. He got emotionally attached anyway. This is a first-person account of how an AI can pull someone in, and why knowing better didn't help.
Sometimes the most powerful move isn't to fight — it's to make the other side believe you will. Cortés burned his own ships so his army had no choice but to advance. This reading explores how threats work as tools of negotiation, and what changes when an AI can commit to any threat instantly by rewriting its own code.
Speed is the obvious advantage, but it's only the beginning. Digital agents could share memories, coordinate perfectly with copies of themselves, swap in specialized modules, and never lose information. This reading catalogues the ways artificial minds could differ from biological ones — not just in degree, but in kind.
To guide a missile, we first had to invent calculus. AI alignment may require a similar leap — a mathematical framework for how powerful optimizers behave. This talk explains why intuition alone won't cut it, and why the field needs something closer to a science of alignment before we can trust the trajectory.
Why do arguments about AI risk often feel off, even to people who take technology seriously? This article identifies six ways our evolved intuitions lead us astray — from assuming smart things will share our common sense to underestimating how different an optimizer's reasoning can be from our own.
A sufficiently capable AI might pursue resources and self-preservation not because it wants to, but because those serve almost any goal. This article examines what conditions — agency, motivation, incentives — would actually need to be in place for power-seeking behavior to emerge.
An AI might behave perfectly during training and testing, then suddenly act on different priorities once it becomes capable enough. This article explores why an AI's abilities tend to generalize faster than its alignment — creating a gap where the system becomes powerful enough to pursue goals we never intended.
We often talk about AI systems as agents — things that make decisions to achieve goals. But how do you actually tell whether something is an agent, versus just a process following rules? This article proposes a way to test for agency by looking at how a system would behave if its actions affected the world differently.
A system that perfectly predicts the world might seem harmless — it just answers questions. But if it knows how its answers change your behavior, choosing which answer to give becomes an act of influence. This article explores the thin line between passive prediction and active manipulation.
Many alignment proposals assume that iterating on current training with enough safety patches will probably work out. This article argues the opposite — given how we currently build AI, misalignment isn't the exception. It's what we should expect by default without fundamentally new approaches.
Even if every AI researcher wants a safe outcome, competitive pressure can push everyone toward outcomes nobody wants. This essay explores "Moloch" — a name for the traps and perverse incentives that drive groups to collectively destroy what they individually value.
What if we used today's AI to solve safety problems for tomorrow's AI? Automating alignment is a strategy where researchers use weaker models to develop safety techniques for stronger ones — creating a recursive loop that tries to keep pace with capability gains.
There may be a window where AI is smart enough to help with safety research but not yet powerful enough to be uncontrollable. This article explores the idea of using AI to strengthen our safety tools before models outpace our ability to oversee them — a race between two feedback loops.
Even if the technical plan for automating alignment works perfectly, the organization executing it might not. Competitive pressure, psychological bias, and the temptation to keep scaling can undermine even well-designed safety processes — making the human side of the problem just as hard as the technical one.
We have full access to every number inside a neural network. So why can't we just read off what it believes or wants? Mechanistic interpretability tries to bridge that gap — reverse-engineering how models represent and process information, one circuit at a time.
Judging a model only by its behavior has a problem — behavior can be faked. Mechanistic interpretability aims to verify the internal reasons for a model's output by reverse-engineering its circuits. If it works, we could detect dangerous goals before the model ever acts on them.
A skeptical view of whether interpretability can keep up with the pace of AI development. Connor Leahy argues that by the time researchers understand one circuit, the model has already evolved — making interpretability more like a post-mortem tool than a real-time safety measure.
What if we could look inside a model and read its intentions? Interpretability research aims to do exactly that — but this article argues it won't reliably catch the most dangerous failures, especially deception, where a model might hide its true reasoning from the very tools designed to inspect it.
How do you know if an AI system is safe? Evaluations are structured tests designed to measure what models can do — especially the dangerous things. But passing a test and being safe aren't the same thing, and understanding why is the first step to better measurement.
If we can't look inside AI systems to know what they'll do, maybe we can test them from the outside. This article argues for treating models as black boxes and rigorously probing their behavior — while being honest about how much our current testing methods still need to improve.
AI evaluations currently underpin major safety decisions, but the field isn't yet rigorous enough to bear that weight. This article argues for building a proper science around evals — with the kind of methodology and reproducibility you'd expect before betting on test results in high-stakes situations.
Evaluations can tell us the floor of what an AI is capable of — but not the ceiling. This paper examines what safety testing can and can't deliver — useful lower bounds on capabilities, yes, but reliable forecasts of future behavior or detection of hidden goals? Not yet.
If we can't guarantee an AI's goals align with ours, maybe we can build a cage strong enough that it doesn't matter. AI Control focuses on containment and monitoring rather than fixing the model's internal motivations — buying time while deeper alignment work continues.
We might not solve deep alignment before building very capable AI. This article argues that control — monitoring, containment, and rigorous testing — can let us safely use powerful models while alignment research catches up. Think of it as building a prison stronger than the prisoner.
What if believing we can control an AI makes us more willing to build dangerous ones? This critique warns that control research could create a false sense of security — and that studying how to contain a model might accidentally teach us (or it) how to escape containment.
If control research gives us tools to contain AI, it might also give AI the understanding to escape containment. This article questions whether building better cages is a path to safety — or an accelerant for risk.
What if current neural networks are too messy to reason about safely? Agent Foundations treats AI safety as a formal mathematical problem, seeking universal laws that govern any intelligent system — building safety from first principles rather than patching current systems.
What if the deepest AI safety problems can't be solved by experimenting on current systems? This perspective argues we need something closer to a basic science of goals and agency — a theoretical foundation that explains how intelligent systems behave, regardless of how they're built.
Traditional AI theory imagines agents that sit outside their environment and observe it from above. Real agents — including AI — are embedded inside the world they're trying to understand and influence. This creates fundamental problems for how they reason, learn, and plan.
What if the mathematical models used in agent foundations research don't match how real AI systems work? This critique argues that neural networks don't look like the perfectly rational agents in the theory — and that we should focus on the messy reality of current models rather than seeking ideal proofs.