Big data analysis

What are the goals of analyzing Big Data? According to, two main goals of high-dimensional data analysis are to develop effective methods that can accurately predict the future observations and at the same time to gain insight into the relationship between the features and response for scientific purposes. Valid statistical analysis for Big Data is becoming increasingly important.ġ.2 Goals and Challenges of Analyzing Big Data The massive amounts of high dimensional data bring both opportunities and new challenges to data analysis. For example, scientific advances are becoming more and more data-driven and researchers will more and more think of themselves as consumers of data. This trend will have deep impact on science, engineering, and business. The existing trend that data can be produced and stored more massively and cheaply is likely to maintain or even accelerate in the future. This is also true in other areas like social media analysis, biomedical imaging, high frequency finance, analysis of surveillance videos and retail sales. For example, in genomics we have seen a dramatic drop in price for whole genome sequencing. Such a Big Data movement is driven by the fact that massive amounts of very high dimensional or unstructured data are continuously produced and stored with much cheaper cost than they used to be. We are entering the era of Big Data - a term that refers to the explosion of available information. They can lead to wrong statistical inferences and consequently wrong scientific conclusions. In particular, we emphasize on the viability of the sparsest solution in high-confidence set and point out that exogeneous assumptions in most statistical methods for Big Data can not be validated due to incidental endogeneity.

We also provide various new perspectives on the Big Data analysis and computation. This article gives overviews on the salient features of Big Data and how these features impact on paradigm change on statistical and computational methods as well as computing architectures. These challenges are distinguished and require new computational and statistical paradigm. On the other hand, the massive sample size and high dimensionality of Big Data introduce unique computational and statistical challenges, including scalability and storage bottleneck, noise accumulation, spurious correlation, incidental endogeneity, and measurement errors. On one hand, Big Data hold great promises for discovering subtle population patterns and heterogeneities that are not possible with small-scale data. Big Data bring new opportunities to modern society and challenges to data scientists.