In this video, Mark Niemann-Ross discusses visualization of high-volume data. Learn how to select an appropriate type of graph to communicate a message and avoid overplotting.
- [Instructor] Generally speaking, the more data you have, the better, right? Well, sometimes. This graph is overwhelmed with just a few outliers. When you create graphs and charts, remember you're telling a story. Good stories have interesting plots, compelling characters, and crystal clear conclusions. If your graph is muddled with too much data, your conclusions won't be crystal clear. Plots with too many data points are the classic problem of missing the forest for the trees.
Too much data on a graph causes overplotting, and it's a common hazard of dealing with high volume data. There are five ways of dealing with overplotting, use a different type of graph, add rug or jitter, and we'll talk about that in a minute, use statistics like linear modeling or clustering, you can subsample the data, and you can also use something called trellis.
R provides a selection of different graphs for different types of data and conclusions. Learning about the built-in graphs, as well as additional graphs and packages, will provide you with a rich visual vocabulary to express your ideas. R also provides rug and jitter to enhance graphs with clarifying information. Statistics can be applied to data sets to identify trends and simplify graphs.
It's not unusual to use a linear regression model or clustering to identify trends in high volume data. Sometimes data can be subsampled, taking into account confidence levels, random, and weighted selections. I'll demonstrate some of these techniques for dealing with overplotting. The important thing to remember is that more is not better. Always consider the message you're trying to present, and the most efficient way to present that message.
- Accessing memory and processing power
- Visualizing high-volume data
- Profiling and optimizing R code
- Compiling R functions
- Parallel processing with R
- Using R with other big data solutions