This work builds an abstract text summarizer model for the Arabic language content using the state-of-the-art “Transformer” language model and proposes an iterative data augmentation technique which uses synthetic in-domain data along with the real summarization in- domain data for Arabic.
As more and more long-form content in the domain of public health is being published on the Internet, it is becoming increasingly time-consuming for someone like advice seeker, healthcare professionals, and researchers to filter out the content they want. Proper summaries can help these people to quickly go through the highlights of a huge amount of textual content and save an important amount of time to study the relevant documents. The emergence of large language models has redefined the state of the arts for automatic text summarization. However, the availability of in-domain datasets for the task of text summarization is scarce, and such datasets are hard to construct. In this work, we build an abstract text summarizer model for the Arabic language content using the state-of-the-art “Transformer” language model. We propose an iterative data augmentation technique which uses synthetic in-domain data along with the real summarization in-domain data for Arabic. We investigate the performance of our model based on a fine-tuned BART model using a test set. We show that this strategy produces better summarization performance in terms of ROUGE score. We achieve a slight improvement of +0.6 and +0.5 in ROUGE1 F1 (R1 F1) on the development and test sets, respectively. We compare it to the model which does not rely on data augmentation.