Home / Papers / Data augmentation approaches in natural language processing: A survey

Data augmentation approaches in natural language processing: A survey

DOI: 10.1016/j.aiopen.2022.03.001Source

313 Citations•2022•

Bohan Li, Yutai Hou, Wanxiang Che

This paper frames DA methods into three categories based on the diversity of augmented data, including paraphrasing, noising, and sampling, and introduces their applications in NLP tasks as well as the challenges.

Abstract

As an effective strategy, data augmentation (DA) alleviates data scarcity\nscenarios where deep learning techniques may fail. It is widely applied in\ncomputer vision then introduced to natural language processing and achieves\nimprovements in many tasks. One of the main focuses of the DA methods is to\nimprove the diversity of training data, thereby helping the model to better\ngeneralize to unseen testing data. In this survey, we frame DA methods into\nthree categories based on the diversity of augmented data, including\nparaphrasing, noising, and sampling. Our paper sets out to analyze DA methods\nin detail according to the above categories. Further, we also introduce their\napplications in NLP tasks as well as the challenges. Some helpful resources are\nprovided in the appendix.\n