In this video, Mark Niemann-Ross provides a framework for understanding high-volume data. Learn the difference between high-velocity, high-volume and high-variety data. Discuss the two challenges to high-volume data and methods for handling those challenges.
- [Instructor] Data can come in three ways, high volume, high variety, and high velocity. High-volume data is where the data set is large and unwieldy. It can reach into range of gigabytes, terabytes, or even petabytes. High-variety data is where the data is a rich mixture of sources. Much of available data fits into this category. Examples include spreadsheets, data stored in flat files, relational databases or NoSQL databases.
You can find data stored as HTML, tab separated or comma separated files. High-velocity data is where the data comes in at high speeds. Examples include 250 million tweets per day streaming from Twitter.com or 100 gigabytes of data per day streaming from the New York Stock Exchange and just about anything coming from real-time sensors from the Internet of Things which can arrive at a gigabit per second.
This isn't to say that data can arrive in more than just one of these categories. High-velocity data such as the stock exchange recorded over a period of time rapidly results in a high volume of data. But in this course, we'll focus on the challenges presented by handling one of those three. High-volume data. High-volume data presents these three challenges, available memory, processor speed, and visualization and overplotting.
Of course, there is some debate about what is high-volume data. Jan Wijffels proposed in his talk on the user conference that there is a tri-section of data according to its size. Less than one million records, you can process with Base R. One million to one billion records, you can process it with Base R, but you'll need to take some extra effort. Greater than one million records, now you're into the heavy duty tools such as MapReduce and Spark.
Researcher Dana Boyd defines big data with three parameters, technology, analysis, and mythology. With technology, you're trying to maximize computer power to analyze large data sets. With analysis, you're trying to identify patterns. And with mythology, there is a belief that large data sets offer a higher form of intelligence, which may or may not be true. In practical terms, big data is unwieldy.
It can cause crashes and processing it can take forever. So if you're going to work with high-volume data, prepare your system and select the information you need to handle these huge demands. So let's take a deeper look at these challenges of dealing with high-volume data.
- Accessing memory and processing power
- Visualizing high-volume data
- Profiling and optimizing R code
- Compiling R functions
- Parallel processing with R
- Using R with other big data solutions