This work investigated how reinforcement learning (RL) can be used to finetune the downstream performance of pretrained Language Models (LMs) and implemented an off-policy and value-based algorithm: Deep-Q Learning (DQN), which can be applied to finetuning langauge models for downstream applications.
Recent interest in Large Language Models (LLMs) and human alignment calls for effective finetuning methods. In this work, we investigate how reinforcement learning (RL) can be used to finetune the downstream performance of pretrained Language Models (LMs). Recently, on-policy RL algorithms have shown promise for text generation tasks. However, they face several empirical challenges, including (1) training instability due to the large action space and (2) sample inefficiency. In this paper, we explore methods to address both of these limitations. First, we implemented a variety of sampling techniques which effectively restrict the total action space without compromising performance, and which show significant improvement over vanilla proximal policy optimization (PPO). Second, we implemented an off-policy and value-based algorithm: Deep-Q Learning (DQN). We demonstrate the DQN can be applied to finetuning langauge models for downstream applications. However, futher exploration and tuning is required to determine whether it can achieve better sample efficiency compared to on-policy algorithms.