Abstract
In the past two decades, new developments in computing, sensing and crowdsourced data have resulted in an explosion in the availability of quantitative information. The possibilities of analyzing this so-called 'big data' to inform research and the decision-making process are virtually endless. In general analyses have to be done across multiple data sets in order to bring out the most value of big data. A first important step is to identify temporal correlations between data sets. Given the characteristics of big data in term of volume and velocity, techniques that identify correlations not only need to be scalable, but also need to help users in ordering the correlation across temporal resolutions so that they can focus on important relationships. There is a large body of work in this area, however, most of them either only deal with small data sets, using a fixed temporal resolution, or does not provide a quantifiable measure of a correlation significance. In this paper, we present a method based on mutual information to identify correlations in large data sets. Discovered correlations are suggested to users in an order based on their significance. Our method supports an adaptive streaming technique that minimizes duplicated computation and is implemented on top of Apache Spark for scalability using big data platforms. We also provide a comprehensive evaluation using real-world data sets from NYC Open Data, and compare our findings against a recent study.
Original language | English |
---|---|
Journal | Proceedings - 2016 IEEE International Conference on Big Data, Big Data 2016 |
Pages (from-to) | 666-675 |
Number of pages | 10 |
DOIs | |
Publication status | Published - 2016 |
Externally published | Yes |
Event | 4th IEEE International Conference on Big Data, Big Data 2016 - Washington, United States Duration: 5 Dec 2016 → 8 Dec 2016 |
Conference
Conference | 4th IEEE International Conference on Big Data, Big Data 2016 |
---|---|
Country/Territory | United States |
City | Washington |
Period | 05/12/2016 → 08/12/2016 |
Sponsor | Cisco, et al., Huawei Technologies Co., Ltd., IEEE, IEEE Computer Society, National Science Foundation (NSF) |
Bibliographical note
Publisher Copyright:© 2016 IEEE.
Keywords
- adaptive sliding window
- Big Data
- mutual information
- streaming
- temporal correlation