How do we get AI systems to align with real human social contracts and values? Or, in more mathematical terms, how do we make legible our soft and squishy human values into hard mathematical formula and policy?

Mostly sourced from OpenAI’s approach to alignment

  • RLHF: Summarization from human feedback was really the first convincing proof-of-concept that RLHF works on language models and that you can optimize goals that are fuzzy and somewhat ambiguous.
    • How do we optimize for goals that are not easily quantizable?
  • InstructGPT demonstrated that there is a real “alignment overhang” in language models that wasn’t very hard to access. The amount of human feedback needed to achieve an astounding 100x improvement was pretty moderate and achievable: ~50,000 comparisons, and ~300,000 episodes of training. That number is so small that we could actually have humans hand-label every training episode
  • Using models to augment rather than replace. Helping humans find 50% more flaws that they would have unassisted with a model that isn’t superhuman on a task that isn’t hard for humans is a surprisingly strong result, showing that our model can basically already add a lot of value for feedback assistance.