A report of the “Human Migration – Potential areas for combinations of Big Data” workshop at SOCINFO 2020
Exploratory: Migration Studies
The 12th International Conference on Social Informatics (SOCINFO 2020) organized by the National Research Council of Pisa (CNR, Pisa), opened on 6 October. The rich program included the "Human Migration - potential areas for combinations of big data" workshop, organized by Tuba Bircan, Carlos Arcila Calderón, Ji Su Kim, and Alina Sîrbu. The workshop aimed at enabling the sharing of experiences with Big Data and migration among an interdisciplinary set of researchers, institutions, and industry.
The goal here is to briefly resume the content of the workshop, trying to offer some references and insights on the current research that is done in the field, as well as some food for thought, to the reader.
First session.
The first talk, introduced by Tuba Bircan, and titled "Data Science against COVID-19 - the mobility perspective", by keynote speaker Nuria Oliver, focused on mobile phone data for public health decisions in the Valencia region during the Covid-19 emergency. Using the analysis of mobile phone data from three major Spanish telephone companies, the researchers focused on changes in mobility during the lockdown, on different types of mobility, and their epidemiological impact. The results show a 65% decrease in gyration's radius, with rates between 88% and 92% of individuals who have not left the residence area during the lockdown, suggesting the success of the stay-at-home campaign. By adding the results of a city survey, Oliver observes that 10% of the population did not leave their home, that 33% (reduced to 26% in the week with no labour mobility) of the people went to work, and finally that 34% of individuals worked by telework. Finally, Oliver described the epidemiological model of the metapopulation, SEIR, and the predictive model based on time series.
The second talk, "Modeling the bias of digital data: an approach to combining digital and survey data to estimate and predict migration trends" was presented by Yuan Hsiao. Based on the importance of understanding and predicting migration flows, the work combines different types of data for doing so. The speaker presented the results of the predictive models performed by including (a) only data from Twitter, which shows both a certain degree of error and bias, and (b) only data from ACS, which instead has only errors. Finally, Hsiao presented the combined model, which aims to study spatial and temporal effects by combining Twitter data with ACS data. The results obtained by cross-validation show that the combined model is more reliable than the one based solely on official statistics.
The third talk, titled "Detecting hate speech against migrants and refugees in Twitter using supervised text classification", was presented by Javier Amores. The speaker first provided an overview of hate speech and its increase in recent years on social platforms. Using a supervised textual classifier, the authors validated a detector of hate speech in Spanish, focusing on racism directed to migrants and refugees.
Finally, with "Measuring the European Salad Bowl with Superdiversity", Laura Pollacci presented the Superdiversity Index as a measure of the distance between standard emotional use and specific communities. The index, validated by comparison with official statistics and other possible diversity indices, was measured for various European countries. The results show that the index correlates well with immigration rates and that it could be used, in combination with other measures, for a nowcasting model of immigrant stocks.
Second session.
The second session was chaired by Carlos Arcila, which introduced the second keynote speaker, Marzia Rango, who gave the talk "Harnessing data integration for migration studies - the Big Data for Migration Alliance". After a comprehensive overview of the context of migration studies, Rango motivated the recent search for new data to complement, improve, and advance current research, presenting some recent studies and the latest research advances towards new data sources, and highlighting the pros and cons of the unconventional data. The presentation then dealt with the possible challenges connected with and deriving from the use of unconventional data in research on migration, such as security and ethics, and the talk ended highlighting the need for awareness that allows the creation of models and the dissemination of useful and genuinely reliable results.
The second session continued with the talk "Brain Drain and Brain Gain in Russia: Analyzing International Migration of Researchers by Discipline using Scopus Bibliometric Data 1996-2020" by Alexander Subbotin. The presented work uses bibliometric data to study migrating researchers and their trajectories through the changes in the affiliations, with the main goal of quantifying the overall level of academic migration in Russia. Unlike recent years, net migration rates showed Russia to be on the losing side of a brain circulation system in the late 1990s and early 2000s. However, the trend has become positive in recent years. Then, a second talk by Laura Pollacci on “Studying Brain Drain with Big (scholarly) Data” explored how collaborations between researchers evolved at multiple scales by measuring the individual researchers' trend to collaborate with colleagues working in institutions in the same nation or not. Finally, the quartet consisting of Tuba Bircan, Albert Ali Salah, Matteo Pignotti, and Carlos Arcila Calderón alternated in exposing “From Turkey to Europe: Movements of people at the Turkish border in March 2020”, an extensive work on data relating to the recent significant exodus of political refugees towards the Greece border. The multi-faceted research includes analyzing Twitter data's language and location, together with lexicon-based sentiment analysis and a CDR data analysis.
The third session started with Hillel Rapoport, presenting "Migration and Cultural Change". In human history, the fear of immigrants and the effects of migration in the host country have always been present. There is no doubt now that immigration can change (to some degree) the traditions of the receiving country. Starting from the general research question "How migration is changing our world?", the talk focused on how globalization affects cultural similarities and immigration: can globalization lead to cultural convergence? Does globalization create convergence or divergence? Rapoport presents a theoretical model that considers two countries and aims to qualify cultures through cultural exchange. The model can grasp different mechanisms, both static and dynamic, of cultural exchange. The work presented aims to propose a simple migration-based model on the cultural changes, to develop cultural proximity from the survey as the distance between individuals, and finally, to understand which is the dominant mechanism in cultural exchange. The conclusion is that cultural change is determined by migration and, also, migration affects cultural formation in both sending and receiving countries through various mechanisms. The net effect of migration on cultural similarity seems to be positive since migration favours cultural convergence.
Then, Riccardo Guidotti presented "Measuring Immigrants Adoption of Natives Shopping Consumption with Machine Learning". The work aims to measure the integration of immigrants through shopping consumption. The work is based on the concept "Tell me what you eat, and I'll tell you who you are", and on the relation between culture and food consumption. In particular, this relationship can be fascinating because supermarket purchases expose immigrants much less to external judgment and because purchases can be observed over a long period. With this work, the authors look at the "stay phase" of migration, and thus, to integration. For each customer, the dataset is composed of various baskets with a specific cost and, in turn, consisting of different products. Starting from the country of birth and the country of reference of customers, authors build a Machine Learning classifier to see if non-natives integrate over time by adopting consumption habits. The model returns a customer's probability of being classified as native. The authors build twelve different models to be able to also observe seasonal purchases (e.g. typical food of holiday periods, such as Christmas). The analysis performed on the UniCoop, relating to customers from 158 different countries produced five well-separated clusters, namely, increasingly native-refusers, stable native-refuser, early native-adopters, late native-adopters, and native-like customers. Furthermore, it can be observed that there is a general adoption of the purchasing habits of the destination country.
The final talk was given by Ji Su Kim and was entitled "Tell us what you think: home and destination attachment for migrants on Twitter Data". The work aims to measure cultural integration by using hashtags on Twitter. The authors look at country-specific hashtags to measure the immigrants' integration. To identify migrants, the authors define the country of residence using the place with the most extended length of stay and determine users' nationality by looking at the tweet's location and language of a user and the user's friends. Here, an immigrant is defined as a user with a country of residence different from the origin country. Then authors assign a nationality to hashtags putting a threshold to obtain only the most country-specific. Finally, Kim presented Home and Destination attachment indexes, which try to measure how much an individual is interested in what is happening in his/her country of origin and in what is happening in his/her country of residence, respectively. The results show that Italian immigrants are more attached to the country of origin than the destination country; however, the Destination attachment increases in English-speaking countries (US and GB), and Spain. The proficiency of the language of the host country facilitates a higher Destination attachment level. The Destination attachment level depends on the migration country and the Home attachment on the country of origin. Also, sharing borders increases the Home attachment level.
Final discussion.
It is clear that one of the main challenges in using new data sources is the validation of results and models: it is necessary to understand when we can rely on the results obtained, and when instead we are observing data that are too distant from reality or represent only some parts of the population.
The validation of the results obtained with big data is difficult due to two main factors: the lack of ground truth and the incompatibility of the data. We can often get very up-to-date unconventional data, but these may be incompatible with typically less updated official data. Another problem is the difference between the observation of single individuals and entire communities. Can we apply the same macro-level insights to understand the behaviour of individuals?
The discussion turned on the difficulty of integrating quantitative and qualitative methods—most of the issues related to the technical aspects and skills of the researchers. An interdisciplinary approach is therefore increasingly necessary.
The last part of the final discussion focussed on the issues related to ethics and privacy. When studying immigration, it is first of all necessary to think about the implications of the results. Research must continue to benefit from new data sources. However, especially at the time of publication of the data and results, researchers must provide some guarantees, such as the impossibility of tracking and identifying people. This attention becomes increasingly important given that users, especially social networks, tend to share a lot of personal information, even very sensitive ones. In this context, it would be essential to open communication channels with both companies and institutions. A well-planned constant collaboration could make available a large amount of new data that is very useful for answering the still open questions of research. At the same time, by planning the exchange of information correctly, it may be possible to guarantee both the general reliability of the data, as well as ethical and privacy rights.
Written by: Laura Pollacci