Weak-To-Strong Generalization: Eliciting Strong Capabilities With Weak Supervision

AI Safety Fundamentals: Alignment - Een podcast door BlueDot Impact

Probeer Podimo de eerste 60! dagen gratis

Luister 30 dagen gratis naar exclusieve podcasts en duizenden luisterboeken

Categorieën:

Widely used alignment techniques, such as reinforcement learning from human feedback (RLHF), rely on the ability of humans to supervise model behavior—for example, to evaluate whether a model faithfully followed instructions or generated safe outputs. However, future superhuman models will behave in complex ways too difficult for humans to reliably evaluate; humans will only be able to weakly supervise superhuman models. We study an analogy to this problem: can weak model supervision elicit t...

Visit the podcast's native language site