Unleashing the Power of Big Data to Build Smart Cities

by Giulia Preti, dbTrento Group, University of Trento, Italy

The steady growth of the world population has increased the demand of resources like food, water, and energy, and thus providing a good quality of life and adequate services to the citizens has become a challenge for today's cities. The research community has given considerable attention to this problem, hence using open data collected by public institutions and organizations for urban planning [1, 2], zoning regulation [3], public transport design, disease outbreak control, energy management [15], and disaster management [4, 5]. A large amount of this data is generated by sensor networks monitoring variables such as air quality, water quality, noise pollution, and road traffic load. These networks are dynamic by nature, as the value recorded by each node change over time, new nodes can be added or removed from the network, and links can fail or be removed as well.
In this work, we focus on the detection of dense groups of correlated edges in dynamic networks, which are groups of edges that are topologically close and changed in a similar way in a period of time. These structures are important in root cause analysis and anomaly detection, among others. For example, a traffic accident affects the speed along neighbor roads, and thus it causes the appearance in the road network of a dense group of edges with similar levels of activity. Similarly, the disappearance of a dense group of edges in a mobile network can be caused by the failure of the base station serving the corresponding nodes.

Problem Definition

The input is a dynamic network, which is a series of labeled graphs where the set of nodes is fixed, while the edges can change both structurally (i.e., they can appear/ disappear) and qualitatively (i.e., they have a weight that changes over time). We propose two measures to compute the density of a group of dynamic edges. The first measure is the minimum average node degree in the subgraph induced by the edges measured in the snapshots of the network. The second measure, instead, is the average between the average node degrees in all the snapshots. Similarly, we propose two measure to express the temporal correlation of a group of edges. The first one is the minimum Pearson correlation between the series of values associated to the edges in the group, while the second one is the average Pearson correlation.
Then, given a density and a correlation threshold, our goal is to find all the maximal groups of edges with density and correlation greater than the respective threshold. We analyze four different formulations of this problem, obtained by pairing the four measures described above.
In addition, since some types of dynamic networks may naturally contain a large number of dense groups of correlated edges, we study also the problem of identifying a more compact subset of results that is representative of the whole set. In particular, given a threshold on the maximum pairwise Jaccard similarity between groups of edges, we want to find a set of groups with pairwise overlap less than the threshold.

Method

We developed an exact solution to two problem formulations, i.e., that based on the minimum correlation and the minimum density, and that based on the minimum correlation and the average density. The algorithm is a two-phase approach that first identifies maximal groups of correlated edges, and then extract those subgroups of edges that form a dense subgraph.
In the first phase, we create an auxiliary graph where nodes represent edges of the original graph, and links exist between edges having correlation above the given threshold.
We can easily show that a maximal clique in the auxiliary graph corresponds to a maximal group of correlated edges. In fact, since all the nodes in a clique are connected, the edges they represent have correlation above the threshold, and thus the minimum correlation value in the group is greater than the threshold. We used Min hashing [12] to efficiently generate the auxiliary graph, and a variant of the TAPER algorithm [2] to implement the search of maximal cliques.
In the second phase, we extract the connected components from all the cliques, retaining only those that are maximal to avoid redundant computations in the next steps. Then, we examine each component, starting from the largest ones to allow an early termination of the algorithm when a component is found to be dense. We recall that we are interested only in the maximal dense subgraphs.
If the density of a component is below the threshold, all its subcomponents are recursively examined. Flags are used to avoid the examination of the same subcomponent multiple times.

Results and Conclusions

We tested our algorithms on a real dynamic network created using Twitter data collected from 2011 to 2016 by filtering keywords related to the gun control, the abortion, and the Obamacare topic.
The preliminary results, obtained using samples of this network of increasing size, allowed us to identify the limitations of our exact solution and prove the effectiveness of the hashing technique used in the approximate solution. In particular, in the exact solution, the time required to generate the pairs of correlated edges increased exponentially with the size of the network, whereas a careful tuning of the parameters in the approximate solution led to accurate results in reasonable amounts of time.
The experiments conducted using different thresholds on the minimum correlation and minimum density, required for a group to be part of the result set, allowed us to compare the quality (i.e. interestingness) of the groups of edges found. In particular, they showed that a social network like Twitter tends to contain a lot of overlapping dense communities of correlated edges, which in some applications convey redundant information. This result therefore motivates the need for our problem formulation that has the goal of finding a compact subset of diverse groups.
Nonetheless, more experiments are required before assessing the importance of the results found and how they can be applied to real world problems.

Future Work

We plan to introduce additional optimizations to speed up our algorithms, and then perform further tests on a Mobile Phone network, as well as on synthetic graphs. The experiments on the synthetic graphs will help us to assess the effectiveness of our approaches. Finally, we plan to implement solutions to the remaining two problem formulations.

Acknowledgements

The research was conducted under the supervision of prof. Aristides Gionis at the DMG Group of Aalto University, Espoo, Finland, thanks to the SoBigData TNA program.

References

[1] Phithakkitnukoon, Santi, et al. "Activity-aware map: Identifying human daily activity pattern using mobile phone data." International Workshop on Human Behavior Understanding. Springer, Berlin, Heidelberg, 2010.
[2] Yuan, Yihong, and Martin Raubal. "Extracting dynamic urban mobility patterns from mobile phone data." International Conference on Geographic Information Science. Springer, Berlin, Heidelberg, 2012.
[3] Toole, Jameson L., et al. "Inferring land use from mobile phone activity." Proceedings of the ACM SIGKDD international workshop on urban computing. ACM, 2012.
[4] Candia, Julián, et al. "Uncovering individual and collective human dynamics from mobile phone records." Journal of physics A: mathematical and theoretical 41.22 (2008): 224015.
[5] Traag, Vincent A., et al. "Social event detection in massive mobile phone data using probabilistic location inference." Privacy, security, risk and trust (PASSAT) and 2011 IEEE Third International conference on social computing (SocialCom). IEEE, 2011.
[6] A. Angel, N. Sarkas, N. Koudas, and D. Srivastava. Dense subgraph maintenance under streaming edge weight updates for real-time story identification. Proceedings of the VLDB Endowment, 5(6):574–585, 2012.
[7] P. Rozenshtein, N. Tatti, and A. Gionis. Finding dynamic dense subgraphs. ACM Transactions on Knowledge Discovery from Data, 11(3):27, 2017.
[8] P. Rozenshtein, A. Anagnostopoulos, A. Gionis, and N. Tatti. Event detection in activity networks. In Proceedings of the 20th ACM International Conference on Knowledge Discovery and Data Mining, pages 1176–1185, 2014.
[9] T. Fawcett and F. Provost. Activity monitoring: Noticing interesting changes in behavior. In Proceedings of the 5th ACM International Conference on Knowledge Discovery and Data Mining, pages 53–62, 1999.
[10] M. Mongiovi, P. Bogdanov, R. Ranca, E. E. Papalexakis, C. Faloutsos, and A. K. Singh. Netspot: Spotting significant anomalous regions on dynamic networks. In Proceedings of the 2013 SIAM International Conference on Data Mining, pages 28–36, 2013.
[11] M. lgorzata Steinder and A. S. Sethi. A survey of fault localization techniques in computer networks. Science of computer programming, 53(2):165– 194, 2004.
[12] J. Zhang and J. Feigenbaum. Finding highly correlated pairs efficiently with powerful pruning. In Proceedings of the 15th ACM International Conference on Information and Knowledge Management, pages 152–161, 2006.
[13] J. Chan, J. Bailey, and C. Leckie. Discovering correlated spatio-temporal changes in evolving graphs. Knowledge and Information Systems, 16(1):53– 96, 2008.
[14] Z. Guan, X. Yan, and L. M. Kaplan. Measuring two-event structural correlations on graphs. Proceedings of the VLDB Endowment, 5(11):1400–1411, 2012.
[15] P. Palensky, and D. Dietmar. Demand side management: Demand response, intelligent energy systems, and smart loads. IEEE Transactions on Industrial Informatics, 7(3): 381-388, 2011.

SoBigData Articles