STRUCTURAL AND STATISTICAL ANALYSIS OF LARGE DATASETS  OF TERMS AND RELATED ARTICLES: EXAMPLES FROM  WIKIPEDIA

Zoran Nikolić

Article content

Abstract
Disclaimer

Received: >> Accepted: >> Published: 06.06.2021. Review Scientific Paper

<< Prev | Next >>

STRUCTURAL AND STATISTICAL ANALYSIS OF LARGE DATASETS OF TERMS AND RELATED ARTICLES: EXAMPLES FROM WIKIPEDIA

Abstract

Among the most famous collections of publicly available data on the Internet is Wikipedia, which contains millions of articles in many languages covering a wide variety of topics. Complete dumps of all texts from the Wikipedia database in XML format are updated monthly. In this paper, the contents that exist on Wikipedia in the official languages of the former Yugoslavia are analysed and the knowledge base is integrated. Although there are over 10 million articles in this data collection, the number of described terms and topics is significantly smaller, because many articles only redirect to other articles, some are user conversations and some are article templates. A detailed classification of articles, terms, and topics was performed and their mutual connections were obtained (for that, an auxiliary dataset of the English version of Wikipedia was used). Detailed statistical, structural, and cluster analyses were performed on the generated graph of interrelationships of articles, terms, and topics. Using force-directed algorithms for redistribution of graphs, the final result was a comprehensive mapping and visualization of the knowledge base map.

Keywords:

XML, big data, structural graph analysis, graph clustering, graph layout

The statements, opinions and data contained in the journal are solely those of the individual authors and contributors and not of the publisher and the editor(s). We stay neutral with regard to jurisdictional claims in published maps and institutional affiliations.

Latest

See all >

NONDESTRUCTIVE OPTICAL SPECTROSCOPY IN PLANT STRESS RESEARCH: CIRCADIAN RHYTHM AS A DIAGNOSTIC MARKER

THE IMPACT OF HALAL CERTIFICATION ON ENVIRONMENTAL STANDARDS IN THE SERBIAN MEAT INDUSTRY

THE ROLE OF ECOLOGICAL FACTORS IN FREE-RANGE ANIMAL FARMING: IMPLICATIONS FOR FEASIBILITY AND ANIMAL WELFARE

REDUCING GHG EMISSIONS THROUGH IMPROVED WASTE MANAGEMENT: CASE STUDY OF THE CITY OF PRNJAVOR

Cite article

Share options

Informations, rights and permissions

Vol 10, 2021

Copyright

Metrics and citations

Article content

STRUCTURAL AND STATISTICAL ANALYSIS OF LARGE DATASETS OF TERMS AND RELATED ARTICLES: EXAMPLES FROM WIKIPEDIA

Abstract

Most read

Latest

Cite article

Share options

Informations, rights and permissions

Vol 10, 2021

Copyright

Metrics and citations

Article content

STRUCTURAL AND STATISTICAL ANALYSIS OF LARGE DATASETS OF TERMS AND RELATED ARTICLES: EXAMPLES FROM WIKIPEDIA

Abstract

Disclaimer

Most read

Latest