Increasing the overlap between human and AI values.

We study how to scalably steer advanced AI systems to act on human values—so progress in capability is matched by progress in safety.

Explore our work

Research at the intersection.

Two connected areas for understanding altruistic behaviour and translating those insights into safer advanced AI systems.

01 / Alignment

AI Alignment

Developing scalable methods to steer advanced AI systems toward human values, with a focus on self-other overlap.

02 / Altruism

Cognitive Science of Altruism

Studying how humans represent self and others, and how empathy, identity, and prosocial motivation can inform AI alignment.

Our approach

Safety research should connect rigorous theory to the systems people actually build and use.

Scalability

Self-other overlap has comparable scaling properties to fine-tuning techniques on multiple objectives. This is, of course, very implementation-dependent, and there are ways to implement self-other overlap metrics that could be more computationally taxing.

Generality (architecture-agnosticism)

Given that self and other representations are useful for any generally intelligent agent regardless of the specific ML setup, as long as it is easy to create self and other-referencing inputs—which seems to be the case in both reinforcement learning and language modeling—it should be relatively straightforward to adapt the technique to a different architecture, as it makes minimal assumptions.

Low interpretability requirement

While you could potentially make self-other overlap training faster by having better interpretability tools—by knowing which regions of the activation space are more relevant—we expect this method to work with little to no interpretability tools, as performing a specific mathematical operation on the activation matrices does not require a deep understanding of what the activations represent.

Low capabilities externalities

This technique does not rely on making the model more capable in order to advance its alignment with humans. It ideally remains at a similar capability level to a model that is not trained with the self-other overlap objective and, in some cases, it might slightly negatively impact capabilities if there is not enough time allotted in training for convergence.

Selected research

Self-other overlap in theory and practice.

Research introducing self-other overlap as an alignment approach and testing whether it can reduce deceptive behaviour while preserving model performance.

Research agenda · 2024

Bring more perspectives into the overlap.

We welcome conversations with researchers, engineers, funders, and policy teams working toward safer advanced AI.

Start a conversation Explore the work