Understand how to avoid problems with data analysis by understanding the characteristics of data sets.
- [Instructor] Exploratory data analysis is the process of using queries, statistics and visualization to help us understand important properties of a dataset. Queries help us understand what subsets of a dataset look like. Statistics help us understand global properties of a dataset by using descriptive numeric measures like the mean of an attribute or the maximum and minimum value in a column. Visualizations help us see the global properties of datasets.
Now instead of reducing properties to a single number, like the mean, visualizations actually help us see the properties. The reason we want to understand our datasets is that it helps us avoid problems with data analysis. Problems can occur if we make assumptions about properties of data that are not correct. For example, here is a dataset with a single attribute, the average of that attribute is 55. Now the average does not, by itself, give us much information about the shape of the dataset.
This dataset also has an average of 55, but it looks significantly different from the previous example. Another benefit of exploratory data analysis is that it can help us identify problems with datasets including missing values, and values that are outside of an expected range.
- Exploratory data analysis vs. hypothesis-driven statistical analysis
- Performing data quality checks
- Calculating quartiles
- Using box plot to understand the distribution of values
- Using histograms to understand the frequency of values
- Using chi square to understand the correlation between values