Home Archive Organization Program News Contact
PDF download
Cite article
Share options
Informations, rights and permissions
Issue image
Vol 11, Issue 1, 2021
Pages: 15 - 24
Review Scientific Paper
See full issue

This is an open access article distributed under the Creative Commons Attribution License which permits unrestricted use, distribution, and reproduction in any medium, provided the original work is properly cited. 

Metrics and citations
Abstract views: 8
PDF Downloads: 0
Google scholar: See link
Article content
  1. Abstract
  2. Disclaimer
Received: >> Accepted: >> Published: 06.06.2021. Review Scientific Paper

STRUCTURAL AND STATISTICAL ANALYSIS OF LARGE DATASETS OF TERMS AND RELATED ARTICLES: EXAMPLES FROM WIKIPEDIA

By
Zoran Nikolić
Zoran Nikolić

Faculty of Physics, University of Belgrade , Belgrade , Serbia

Abstract

 Among the most famous collections of publicly available data on the Internet is Wikipedia, which contains millions of articles in many languages covering a wide variety of topics. Complete dumps of all texts from the Wikipedia database in XML format are updated monthly. In this paper, the contents that exist on Wikipedia in the official languages of the former Yugoslavia are analysed and the knowledge base is integrated. Although there are over 10 million articles in this data collection, the number of described terms and topics is significantly smaller, because many articles only redirect to other articles, some are user conversations and some are article templates. A detailed classification of articles, terms, and topics was performed and their mutual connections were obtained (for that, an auxiliary dataset of the English version of Wikipedia was used). Detailed statistical, structural, and cluster analyses were performed on the generated graph of interrelationships of articles, terms, and topics. Using force-directed algorithms for redistribution of graphs, the final result was a comprehensive mapping and visualization of the knowledge base map. 

The statements, opinions and data contained in the journal are solely those of the individual authors and contributors and not of the publisher and the editor(s). We stay neutral with regard to jurisdictional claims in published maps and institutional affiliations.