Fine-tuning large neural language models for biomedical natural language processing
Overall, domain-specific vocabulary and pretraining facilitate robust models for fine-tuning and establish a new state of the art on a wide range of biomedical NLP applications.
Abstract
Large neural language models have transformed modern natural language processing (NLP) applications. However, fine-tuning such models for specific tasks remains challenging as model size increases, especially with small labeled datasets, which are common in biomedical NLP. We conduct a systematic study on fine-tuning stability in biomedical NLP. We show that fine-tuning performance may be sensitive to pretraining settings and conduct an exploration of techniques for addressing fine-tuning instability. We show that these techniques can substantially improve fine-tuning performance for low-resource biomedical NLP applications. Specifically, freezing lower layers is helpful for standard BERT- <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>B</mml:mi> <mml:mi>A</mml:mi> <mml:mi>S</mml:mi> <mml:mi>E</mml:mi></mml:mrow> </mml:math> models, while layerwise decay is more effective for BERT- <mml:math xmlns:mml="http://www.w3.org/1998/Math/MathML"><mml:mrow><mml:mi>L</mml:mi> <mml:mi>A</mml:mi> <mml:mi>R</mml:mi> <mml:mi>G</mml:mi> <mml:mi>E</mml:mi></mml:mrow> </mml:math> and ELECTRA models. For low-resource text similarity tasks, such as BIOSSES, reinitializing the top layers is the optimal strategy. Overall, domain-specific vocabulary and pretraining facilitate robust models for fine-tuning. Based on these findings, we establish a new state of the art on a wide range of biomedical NLP applications.