This paper introduces Nash Learning from Human Feedback (NLHF), a novel approach leveraging game theory to model learning from human preferences without explicit rewards. It proposes theoretical foundations and practical implications for developing policies preferred by humans based on KL-regularized frameworks:
The implications of this study extend to enhancing LLM’s interaction with human feedback, fostering models that inherently align with human preferences and making machine learning applications more intuitive and aligned with human values. It’s a promising direction that challenges conventional wisdom and opens up new avenues for research in AI.