Skip to main content

SoBigData Articles

GDELT: a unique, massive and open dataset for unfolding and understanding our society

Exploratory: Demography, Economy and Finance 2.0
 

Have you ever imagined a global database of society easily accessible and open for real time research? The Global Dataset of Events Location and Tone, or simply called GDELT (https://www.gdeltproject.org/) promises to be such a database, and it is supported by Google Jigsaw.

GDELT is an initiative that collects, classifies, and scores every piece of the news you can read from anywhere in the world. In particular, it monitors the daily world’s news articles, labels with a specific theme all the events included in the articles, and makes their information available on the Google BigQuery tool. The labels given to the articles’ events are related to socioeconomic and political themes (see CAMEO Code Reference for the detailed theme list[1]), and the information is about the location and the date each event occured, as well as some other values, such as the the Goldstein and the Tone values related to the the social stability and the sentiment created by the events in every country (see the Data Format Codebook for the detailed explanation of the GDELT data that can be extracted from Google BigQuery[2]). Additionally, what is fascinating for GDELT is that data is available from 1979 up to date, which enables historical analysis, comparisons with the present, and any interesting ideas analysts may have.

Given its massive size and complexity, Google BigQuery contributes to the leverage of GDELT database, offering real time querying and analysis. The most updated GDELT version (version 2) pulls news articles every 15 minutes and more than 100 000 rows per day are loaded on Google BigQuery. Users can query, export, and even conduct sophisticated analyses and modeling of the entire dataset using standard SQL.

How to use GDELT & BigQuery to understand the world

 

Now, let’s imagine that we want to extract the daily total events, per event category, that occurred in Libya, between the middle of March 2019 and the middle of April 2019. Let’s jump into this example by running a query and by extracting together some information with the use of Google BigQuery:

Picture 1: The query composed and run on Google BigQuery, and the daily total number of event news in Libya from the middle of March to the middle of April 2019

Pict. 1 shows the query composed and the daily number of events in Libya, as extracted from Google BigQuery, from the middle of March to the middle of April 2019. For example, on the 14 of March 2019, 5 events related to “Make a public statement expressed verbally or in action” (event code 010) took place in Libya.

Next, we go ahead to save the results from Google BigQuery and visualise this data to understand better the information we extracted. Pict. 2 shows the daily total number of events in Libya as extracted from Google BigQuery, from the middle of March to the middle of April 2019. GDELT depicts a noticeable rise of the total events at the beginning of April 2019, exactly when the “Western Libya offensive” took place. It is also possible to extract the URL of the articles from which each event comes from. As an example, two news articles including events that happened during this period are extracted. In particular, an article from The Guardian (on the 4th of April 2019) and an article from The Telegraph (on the 7th of April 2019 ) are presented under the plot.

Picture 2: The daily total number of events in Libya extracted from Google BigQuery, from the middle of March to the middle of April 2019 and two examples of news  articles extracted  from  the  URL  of  two  events  that  occurred  in  this period. GDELT depicts a noticeable rise of the total events at the beginning of April 2019, exactly when the “Western Libya offensive” took place.

This is only a small example of the GDELT potentials. What is obvious is that the combination of GDELT and Google BigQuery can help to query a country or even the whole planet. And why is this interesting? Well, it allows researchers, policymakers and non governmental organisations to explore, analyse and visualise socioeconomic and political events in the whole world. Considering that GDELT is updated every 15 minutes and it is a free access database it can complement traditional data sources, that are usually costly and time consuming, to overcome their limitations. Amazing possibilities can emerge by monitoring the world through GDELT and/or by combining GDELT with other datasets contributing to societal improvement.

Written by: Vasiliki Voukelatou

Revised by: Luca Pappalardo