This paper by Rafael Rafailov et al., focuses on the use of Direct Preference Optimization (DPO) within token-level Markov Decision Processes (MDPs) to achieve reinforcement learning in language models. This approach bridges the gap between DPO and traditional RLHF methods:
The authors argue that this model allows for more nuanced interpretation of language model responses and could be instrumental in developing sophisticated AI systems capable of complex decision-making tasks. Further studies might explore the applicability of this approach in broader contexts such as multi-turn dialogue and end-to-end system design.