Generation of complete realistic cellular network traffic
Charging Data Records are acknowledged as a standard tool for studying human mobility, infrastructure usage, and traffic behavior. We name such datasets as CdRs to distinguish them from the traditional Call Detail Records (CDRs), describing call and SMS cellular communication only. CdRs describe time-stamped and geo-referenced event types (i.e., data, calls, SMS) generated by each mobile device interacting with operator networks. They comprise city-, region-, or country-wide areas and usually cover long periods (months or years); no other technology currently provides an equivalent per-device precise scope. As a result, CdRs represent a rich source of knowledge valuable to many communities such as sociology, epidemiology, or networking.
Yet, the exploitation of real-world CdRs for research faces many limitations. First, accessibility: CdRs datasets are not publicly available, imposing strict mobile operators' agreements. Second, usability: CdRs are usually available in an aggregated form (i.e., grouped mobility flows and coarse spatiotemporal information), limiting related analyses' preciseness. Third, privacy: even anonymized, non-aggregated CdRs describe sensitive information of users' habits, which hardens their shareability. This project aims to address such limitations by enabling the scientific community's autonomous generation of realistic and privacy-compliant CdRs, thus providing new avenues for research advances.
In particular, generated CdRs should conform to essential attributes, namely, completeness, realisticness, fine-grained description, and privacy, which makes the generation of realistic CdRs challenging and complex.
To respond to these criteria, we use as a baseline a previous framework (named Zen [R1]) with this same goal. An overview of the Zen framework is provided in Figure 2. Zen architecture consists of (1) a traffic module, (2) a mobility module, (3) a social-ties module, and (4) a CdR-combiner or merger module. The traffic module leverages Long-Short-Term Memory neural networks (LSTM) jointly with statistical analysis to model users' traffic behavior from real-world CdRs. The mobility module (i) emulates users' temporal displacements on a real-world geographical map over a selected period and (ii) associates corresponding users' positions with a real-world cellular topology. This dataset feeds the social-ties module that builds the network social structure on top of which users' communication interactions occur by creating users' phonebooks, i.e., a list of phone numbers a user is likely to contact. Finally, the CdR-combiner module combines all modules' outputs to generate realistic CdRs over a specified duration and particular urban area.
Figure 1: Zen architecture
Despite the validated accuracy of Zen models to reproduce daily cellular behaviors of the urban population, Zen suffers some limitations related to the incompleteness of its reference datasets, whic have only traffic features and lack mobility ones. As a result, Zen fails to reproduce the distribution of the counts of generated events through time and unrealistically correlates users' traffic to mobility behaviors.
Our aim is to fixsuch Zen’s limitations while extending it with the flexibility and generality to adapt its generation to target mobility zones with the help of complete real-world datasets from a major network operator in Chile.
During two visit weeks of joint efforts towards this goal in CNR Pisa with Luca Pappalardo as a host, we ended up with a novel generative model with promising performance and evident beauty. As depicted in Figure 2, our modeling seamlessly combines both deep learning recent techniques highly performant in NLP applications (i.e., self-attention layers [R2]) and the literature legacy on human mobility laws and research ([R3], [R4]).
In particular, we focus on data traffic only (eXtended Data Records) and encode each user traffic as a sequence of the counts of her created data sessions per time slot of fixed length (e.g., 10 mins). Such segmentation allows the model to directly learn the circadian rhythm associated with human events generation, while the mapping of events to their exact timestamp can be done subsequently with a minimized error using an interpolation method, for instance. To cope with the high dimensionality of each user's sequence such modeling induces, we propose to leverage a self-attention layer instead of an LSTM to minimize the account of previous sequence elements corresponding to time slots further than an hour to the time slot of interest. Once the model is trained with only the traffic part of the input dataset, each of its produced users sequence is correlated to the process of generating a realistic trajectory in such a way that the output individual XDR (combined mobility and traffic) fits within the distribution of real users XDR sequences. The mobility trajectory is handled by a mechanistic model of the literature with a high fidelity of reproducing human laws in mobility and its flexibility to consider the mobility zone realistically. At last, the overall model training is done in an adversarial strategy.
Figure 2: "Split-Joint" XDR generation model architecture
In future steps, we plan to implement such modeling, starting with its traffic part. As the complete dataset will only be accessible in Chile for privacy compliance, we are organizing a mission there for this purpose. The mission goal is to train the model and validate the generation results in terms of comparisons of distributions with real-world ones and the general utility of the synthetic dataset in practical applications. This will require handling the raw dataset specificities, i.e., big size and attributes.
Authors:
Anne Josiane Kouam, INRIA & Ecole Polytechnique, Palaiseau, France | anne-josiane.kouam-djuigne@inria.fr
Aline Carneiro Viana, INRIA, Palaiseau, France | aline.viana@inria.fr
Luca Pappalardo, CNR, Pisa, Italy | luca.pappalardo@isti.cnr.fr
Leo Ferres, University of Desarrollo, Santiago, Chile | lferres@udd.cl
Alain Tchana, Grenoble INP, Grenoble, France | alain.tchana@grenoble-inp.fr
References:
[R1] Anne Josiane Kouam, Aline Carneiro Viana, Alain Tchana. 2023. LSTM-based generation of cellular network traffic. IEEE WCNC 2023.
[R2] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N. Gomez, Lukasz Kaiser, Illia Polosukhin. Attention is all you need. 2017. Part of Advances in Neural Information Processing Systems 30 (NIPS 2017)
[R3] Shan Jiang, Yingxiang Yang, Siddharth Gupta, Daniele Veneziano, Shouna Athavale, and Marta C. González. 2016. The TimeGeo modeling framework for urban mobility without travel surveys. PNAS. doi:10.1073/pnas.1524261113
[R4] Pappalardo, L., Simini, F. Data-driven generation of spatio-temporal routines in human mobility. Data Min Knowl Disc 32, 787–829 (2018). https://doi.org/10.1007/s10618-017-0548-4