login
Home / Papers / ParaCrawl: Web-Scale Acquisition of Parallel Corpora

ParaCrawl: Web-Scale Acquisition of Parallel Corpora

130 Citations2020
Marta Bañón, Pinzhen Chen, Barry Haddow

Methods to create the largest publicly available parallel corpora by crawling the web, using open source software are reported on and the quality and their usefulness to create machine translation systems are evaluated.

Abstract

Marta Bañón, Pinzhen Chen, Barry Haddow, Kenneth Heafield, Hieu Hoang, Miquel Esplà-Gomis, Mikel L. Forcada, Amir Kamran, Faheem Kirefu, Philipp Koehn, Sergio Ortiz Rojas, Leopoldo Pla Sempere, Gema Ramírez-Sánchez, Elsa Sarrías, Marek Strelec, Brian Thompson, William Waites, Dion Wiggins, Jaume Zaragoza. Proceedings of the 58th Annual Meeting of the Association for Computational Linguistics. 2020.