AI Alignment
How do we get AI systems to align with real human social contracts and values?
Mostly sourced from OpenAI’s approach to alignment
- RLHF:
Summarization from human feedback was really the first convincing proof-of-concept that RLHF works on language models and that you can optimize goals that are fuzzy and somewhat ambiguous.
- How do we optimize for goals that are not easily quantizable?
- InstructGPT demonstrated that there is a real “alignment overhang” in language models that wasn’t very hard to access. The amount of human feedback needed to achieve an astounding 100x improvement was pretty moderate and achievable: ~50,000 comparisons, and ~300,000 episodes of training. That number is so small that we could actually have humans hand-label every training episode
- Using models to augment rather than replace. Helping humans find 50% more flaws that they would have unassisted with a model that isn’t superhuman on a task that isn’t hard for humans is a surprisingly strong result, showing that our model can basically already add a lot of value for feedback assistance.