The Wikipedia Knowledge Graph

September 5, 2023

7 mins read

Author: Matei Gabriel Cosa

AI research, cool people, and a Bending Spoons aperitivo – these are some of the highlights of Bocconi AI & Neuroscience Student Association’s last event of the academic year. After months of work on interesting and creative projects, our researchers got the chance to present their findings in front of an impressive jury of Bending Spoons AI Researchers and Data Scientists.

The projects ranged from applications of AI to crypto currencies and financial statement analysis, to computational neuroscience, all of which were exciting and challenging. My work as Project Lead focused on creating and analysing a Knowledge Graph of English Wikipedia under the guidance of prof. Luca Saglietti. With the aid of Keshav Ganesh, Kristian Gijka, and Pavle Lalić, we set out to find an appropriate representation for the network of Wikipedia pages. Using a combination of heuristics and state-of-the-art graph neural networks, we investigated the similarity between different pages without accessing their content. The jury awarded first place to our approach.

Goal: from Plato to Wikipedia

The project was inspired by the idea of trying to understand the way in which knowledge is formed. What connections can be drawn between different concepts and if there is a way to explore the structure of human knowledge. Such a question has puzzled the minds of great thinkers across millennia, such as Plato, who came up with his world of ideas. This theory is centred around forms which represent the non-physical essences of all things, while humans only have access to the limited perspective of the physical world.

We might not be able to grasp Plato’s metaphysical collection of knowledge, but we can take a look at the world’s largest free encyclopaedia: Wikipedia. This platform can be thought of as a database of concepts, events, and personalities who shaped how we understand the world today. Therefore, if one attempts to study knowledge, Wikipedia appears to be the right place to start.

In order to analyse the content of Wikipedia, one needs to find a suitable way to represent it. Given our interest in understanding the connections between different topics, a graph seemed like the natural choice. Articles would be the nodes, and the links to other pages would be the edges. This way, we are able to take advantage of several algorithms and techniques from graph theory and machine learning to derive interesting results.

Building the graph

Wikipedia provides access to the content of its pages, as well as metadata through the Wiki Dumps. Snapshots of the entire Wikipedia are available in several formats, including XML and SQL. We opted for the SQL files containing the collection of all pages and the links contained within them, as well as the list of all page ids and their corresponding titles. In total, around 100GB of data were required to gather all the necessary information for constructing the graph representation of Wikipedia, where each node corresponds to a page and each edge to a link between two pages.

The data required extensive processing to guarantee that each page represents a proper article, meaning a page that is neither a disambiguation, nor a redirect, ensuring that the graph contains relevant articles containing useful information. Pages not belonging to the article namespace are also filtered out. Article ids from multiple files are matched to enforce consistency in our final representation, which contains two text files: one containing the page ids and titles and another containing the page ids and links. For more details about the data processing stage, check out the repository available on GitHub.

Finally, we were left with 5.8 million nodes and 140 million edges. To create this structure in our program, we designed a custom Python class that leverages the sparsity of the graph to efficiently store and extract information. The linecache module allowed us to optimise the process of reading lines from our text files. Such operations are needed to obtain inward and outward neighbours of a given node, as it is essential for navigating the graph.

Searching with Dijkstra’s algorithm

Dijkstra’s algorithm is a famous method for finding the shortest paths in a weighted graph. In the initial stage we assumed the weights were all equal. They, however, can be changed in the future to incorporate information about the relationship between pages, such as a measure of similarity.

By this point we could already explore the structure of our graph and stumble upon relatively surprising findings. For instance, there are pages that seemingly have very little in common but are only a few links apart. This phenomenon is supported by the six degrees of separation concept, which claims that all humans are six or less connections away from each other. A similar statistical argument can be made for our network as well. From our experiments, it appears that most pages are 2 or 3 links apart, while the furthest apart pages required 4 links to reach.

An example from our graph in which we wanted to find a path from Mathematics to Lionel Messi. Perhaps surprisingly, such a path only contains three edges.

Empirical study: a personal knowledge graph

Having gathered all the necessary data and tools required to conduct our analysis, we set out to define an appropriate context. Our idea was to imagine every person’s knowledge as a subgraph of the Wikipedia graph, since it is both interesting and computationally less demanding than working with the entire network. In practice, we sampled 10000 pages from four main categories: Mathematics, Philosophy, Sport, and Music.

Starting from this setting, we asked ourselves whether it is possible to find a suitable representation of each node, which allow us to assess the similarity between pages without accessing their content. In other words, could we use our graph to identify similar pages by relying strictly on the topology?

Since interpreting similarity scores is often difficult (e.g., Mathematics and Physics should have a score of 0.5, 0.6, 0.9?), we decided to solve a classification problem. In this formulation, we set out to assign each page one of the four categories, based on which yields the highest similarity score. The advantage of this approach is that results are more interpretable. Furthermore, it is possible to label pages by hand in order to train and validate different models. In fact, we used a set of 530 hand-labelled pages (where each label represents one category) throughout our study.

A heuristic approach

Our first idea was to rely on the length of the shortest paths from each page to the pages corresponding to the categories. Intuitively, pages that are one link away should have more in common than those that are four connections apart. Therefore, for every page we associated a node embedding, i.e., a four-dimensional vector containing the distance from the page to the four categories.

There are several caveats with this approach. Firstly, there could be situations in which, by coincidence, pages that are logically very different appear close to each other. Alternatively, there may be cases where the page of the category lacks a direct connection with the an article that is closely related. Another problem is the abundance of pages with similar distances to the respective category pages. In short, there is not enough variability to properly compute a reasonable similarity measure.

We proposed a solution influenced by a wisdom of the crowds approach. Even if some distances may not reflect the actual similarity, we can assume that on average and in a local neighbourhood of a node embeddings should paint a reasonably accurate picture. Therefore, we could incorporate local information into each embedding by updating its value to receive some data from the neighbours, which is equivalent to taking a linear combination of the current value and the average of the neighbouring embeddings.

This algorithm naturally has two hyperparameters: lambda (the coefficient in the linear combination) and n (the number of total updating rounds). To tune these two parameters, we ran a grid search and chose the values by maximizing the classification accuracy on our labeled sample. The pages were labelled by picking the category with the highest cosine similarity value with respect to the given page. Overall accuracy was around 86%.

Graph neural networks for inductive representation learning

We now move beyond the initial heuristic approach in an attempt to improve our classification accuracy and obtain better embeddings. Our model of choice was GraphSage from “Inductive representation learning on large graphs.” (2017, Hamilton, et al.). This graph neural network architecture lies at the core of large-scale recommender systems, such as the ones used by Pinterest and UberEats. Its strength consists of its ability to generalize new nodes well in dynamic networks and to be trained in a semi-supervised setting, like the one we are in. With only 530 labelled nodes, we hope to learn a model for generating embeddings for our entire graph.

At a basic level, GraphSage includes two fundamental elements: sampling and aggregation. Sampling can be done in several ways, including random walks, neighbourhoods, etc., and is similar to the concept of a mini-batch in traditional deep learning. The main goal of sampling is to collect local information that we will use in the aggregation step. This second step combines the local information and updates the embeddings through the use of aggregator functions. An aggregator function can be a simple mean operation or a more complex non-linear function that the model learns. This is what gives GraphSage its ability to generalize well.

After training the model with both mean and LSTM aggregation, we reached an accuracy of 89% on the classification task. This improvement may seem marginal with respect to the heuristic approach, but the true benefit of deep learning is the potential to produce good embeddings for new nodes. Adding a node to the network would require re-running the updating procedure, which is computationally expensive with the heuristic approach. GraphSage, on the other hand, would immediately output a new embedding based on the learnt aggregator function. Overall, GraphSage is a very powerful model with immense utility when dealing with large, potentially growing networks.

Conclusion

Building a knowledge graph of Wikipedia proved to be a highly complicated task which required extensive data processing, modelling, and planning. Finding and creating models to understand how similar different pages are was an exercise of creativity and understanding state-of-the-art techniques. In the end, beyond the results obtained, the knowledge acquired in the process was the most important reward. Without realising, we found this knowledge that we sought from the beginning: the process we went through to uncover the structure of human knowledge was a proxy for developing our own.

Lorem ipsum dolor sit amet, consectetur adipiscing elit. Ut elit tellus, luctus nec ullamcorper mattis, pulvinar dapibus leo.