From the course: R Programming in Data Science: High Variety Data

Perspectives on high-variety data

From the course: R Programming in Data Science: High Variety Data

Start my 1-month free trial

Perspectives on high-variety data

- [Instructor] Data can be said to come in three ways. High volume, where the data set is large and unwieldy, and can reach into the range of gigabytes, terabytes, or pedabytes. High variety where the data is a rich mixture of sources. Much of the available data fits into this category. Examples include spreadsheets from different companies, data stored in flat files, relational databases, or NoSQL databases. You can also find data stored as HTML, text, tab-separated, or comma-separated files. Finally, high-velocity, where the data comes in at high speeds. Examples include 250 million Tweets per day streaming from Twitter.com, 100 gigabytes of data per day streaming from the New York Stock Exchange, and just about anything coming from real-time sensors from the internet of things, which can arrive at a gigabit per second. This isn't to say that data can arrive in more than one of these categories. High-velocity data, such as the Stock Exchange, recorded over a period of time, rapidly results in a high volume of data. But in this course, we'll focus on the challenges presented by handling one of these three: high-variety data. Data scientists deal with data in different formats. It's part of the job. Data is collected and stored as text files, tab-delimited files, audio files, SQL, and other formats. This range of data formats happens for three reasons. Sometimes formats are chosen because it's just easier for everyone involved. It may be that the people collecting the data are familiar with a specific collection tool like Excel, for example. In this case, it's easiest to use Excel to collect the data than clean it up later with an automated process. It would be difficult to create a custom tool and train the data collection team in its proper use, probably discovering bugs and having to rebuild the tool. And as always, a person's time is more valuable than machine time. Selecting a convenient or familiar tool will increase the amount of data collected and possibly improve the quality. In cases where the format really doesn't matter, don't overengineer the process. Trial and error can be a perfectly acceptable strategy. The second reason for differing file formats is cost, and this becomes a complicated problem. Aside from the fact that more data costs more, data has other related costs. For example, is your data being acquired through an online survey or through a poll worker's knocking on doors? Acquiring data has a cost. Or does your data fit on a piece of paper? Or does it require terabytes of secure and fault-resistant storage? Secure storage is more expensive but cheaper than a data breach. Finally, will your analysis require special tools to work with this data format? For example, consider the cost of processing data stored as audio or video compared to the cost of processing a text transcript. Each of these considerations will quite possibly dictate the variety of data chosen for a particular data-science task. The third reason for the variety of data formats is the nature of the data. As an example, consider the wide range of file formats. Each of these is a variety of data customized for a specific type of data. It would be foolish to store an audio recording as a flat text file. The resulting text file would be unreadable and would most likely lose important parts of the data. Plus a text file doesn't have the ability to store compressed binary data. Storing audio data as text would be incredibly inefficient and use an extraordinary amount of memory. In summary, a variety of data formats has evolved to store a variety of data. It's our job as data scientists to read, write, and clean that data.

Contents