Home / Papers / Proceedings of the ACM Web Conference 2023

Proceedings of the ACM Web Conference 2023

149 Citations•2023•

Kunpeng Guo, Dennis Diefenbach, Antoine Gourru

No TL;DR found

Abstract

Wikidata has grown to a knowledge graph with an impressive size. To date, it\ncontains more than 17 billion triples collecting information about people,\nplaces, films, stars, publications, proteins, and many more. On the other side,\nmost of the information on the Web is not published in highly structured data\nrepositories like Wikidata, but rather as unstructured and semi-structured\ncontent, more concretely in HTML pages containing text and tables. Finding,\nmonitoring, and organizing this data in a knowledge graph is requiring\nconsiderable work from human editors. The volume and complexity of the data\nmake this task difficult and time-consuming. In this work, we present a\nframework that is able to identify and extract new facts that are published\nunder multiple Web domains so that they can be proposed for validation by\nWikidata editors. The framework is relying on question-answering technologies.\nWe take inspiration from ideas that are used to extract facts from textual\ncollections and adapt them to extract facts from Web pages. For achieving this,\nwe demonstrate that language models can be adapted to extract facts not only\nfrom textual collections but also from Web pages. By exploiting the information\nalready contained in Wikidata the proposed framework can be trained without the\nneed for any additional learning signals and can extract new facts for a wide\nrange of properties and domains. Following this path, Wikidata can be used as a\nseed to extract facts on the Web. Our experiments show that we can achieve a\nmean performance of 84.07 at F1-score. Moreover, our estimations show that we\ncan potentially extract millions of facts that can be proposed for human\nvalidation. The goal is to help editors in their daily tasks and contribute to\nthe completion of the Wikidata knowledge graph.\n