Increasing the overlap between human and AI values.

We study how to scalably steer advanced AI systems to act on human values—so progress in capability is matched by progress in safety.

Our approach

Safety research should connect rigorous theory to the systems people actually build and use.

01

Scalability

Self-other overlap has comparable scaling properties to fine-tuning techniques on multiple objectives. This is, of course, very implementation-dependent, and there are ways to implement self-other overlap metrics that could be more computationally taxing.

02

Generality (architecture-agnosticism)

Given that self and other representations are useful for any generally intelligent agent regardless of the specific ML setup, as long as it is easy to create self and other-referencing inputs—which seems to be the case in both reinforcement learning and language modeling—it should be relatively straightforward to adapt the technique to a different architecture, as it makes minimal assumptions.

03

Low interpretability requirement

While you could potentially make self-other overlap training faster by having better interpretability tools—by knowing which regions of the activation space are more relevant—we expect this method to work with little to no interpretability tools, as performing a specific mathematical operation on the activation matrices does not require a deep understanding of what the activations represent.

04

Low capabilities externalities

This technique does not rely on making the model more capable in order to advance its alignment with humans. It ideally remains at a similar capability level to a model that is not trained with the self-other overlap objective and, in some cases, it might slightly negatively impact capabilities if there is not enough time allotted in training for convergence.

Work with us

Bring more perspectives into the overlap.

We welcome conversations with researchers, engineers, funders, and policy teams working toward safer advanced AI.