An extensive analysis of the performance of OCR systems in the Brazilian Portuguese language, specifically in the Brazilian Portuguese language, in the context of detecting misinformation spread through images on social platforms reveals the influence of analyzed image aspects on OCR accuracy.
The performance of OCR techniques is highly dependent on the application context and the language being processed. Studies focused on languages such as Pt- Br and specific contexts are still scarce. Thus, in this work, we present an extensive analysis of the performance of OCR systems, specifically in the Brazilian Portuguese language, in the context of detecting misinformation spread through images on social platforms. To do this, we build a synthetic dataset considering texts from a Pt- Br fact-check labeled data and common patterns of images frequently shared on social media and messaging apps. Our results reveal the influence of analyzed image aspects on OCR accuracy highlighting those with the greatest impact. Further, we report a considerable variation among the evaluated OCR systems in terms of performance.