dc.contributor.author |
Liubonko, Kateryna
|
|
dc.date.accessioned |
2020-02-25T15:09:18Z |
|
dc.date.available |
2020-02-25T15:09:18Z |
|
dc.date.issued |
2020 |
|
dc.identifier.citation |
Liubonko, Kateryna. Matching Red Links with Wikidata Items : Master Thesis : manuscript rights / Kateryna Liubonko ; Supervisor Diego Sáez-Trumper ; Ukrainian Catholic University, Department of Computer Sciences. – Lviv : [s.n.], 2020. – 44 p. : ill. |
uk |
dc.identifier.uri |
http://er.ucu.edu.ua/handle/1/2051 |
|
dc.language.iso |
en |
uk |
dc.subject |
Word embeddings |
uk |
dc.subject |
Graph embeddings |
uk |
dc.subject |
Cross-lingual embedding similarity model |
uk |
dc.title |
Matching Red Links with Wikidata Items |
uk |
dc.type |
Preprint |
uk |
dc.status |
Публікується вперше |
uk |
dc.description.abstracten |
This work tackles the problem of matching Wikipedia red links with existing articles. Links in Wikipedia pages are considered red when lead to nonexistent articles. In other Wikipedia editions could exist articles that correspond to such red links. In our work, we propose a way to match red links in one Wikipedia edition to existent pages in another edition. We solve this task in a context of Ukrainian red links and English existing pages. We created a dataset of 3 171 most frequent Ukrainian red links and a dataset of 2 957 927 pairs of red links and the most probable candidates for the correspondent pages in English Wikipedia. This dataset is publicly released1. We defined the task as a Named Entity Linking problem. Red links are named entities and we link Ukrainian red links to English Wikipedia pages. In this work we provide a thorough analysis on the data and define its conceptual characteristics to exploit in entity resolution. These characteristics are graph properties (connections with the pages where red links occur and connections with the pages which occur in the same pages with red links) and word properties (title names). BabelNet knowledge base was applied to this task. We evaluated its powers in terms of F1 score (29 %) and regarded it as a baseline for our approach. To improve the results we introduced several similarity metrics based on mentioned red links characteristics. Combined in a linear model they resulted in F1 score 85 % which is our best result. In our thesis we also discuss bottlenecks and limitations of the current approach and outline the ideas for future improvements. To the best of our knowledge,we are the first to state the problem and propose a solution for red links in Ukrainian Wikipedia edition. All the code for this project is publicly released on github. |
uk |