- Big data can be characterized by more than the three Vs that we mentioned in the previous movie. Those were volume, velocity, and variety. There are several practical differences as well. Jules Berman has a book called Principles of Big Data: Preparing, Sharing, and Analyzing Complex Information. He lists 10 ways that big data's different from small data, and I want to go through some of those points here. The first is goals. Small data is usually gathered for a specific goal. Big data on the other hand may have a goal in mind when it's first started, but things can evolve or take unexpected directions.
The second is location. Small data is usually in one place, and often in a single computer file. Big data on the other hand can be in multiple files in multiple servers on computers in different geographic locations. Third, the data structure and content. Small data is usually highly structured like an Excel spreadsheet, and it's got rows and columns of data. Big data on the other hand can be unstructured, it can have many formats in files involved across disciplines, and may link to other resources.
Fourth, data preparation. Small data is usually prepared by the end user for their own purposes, but with big data the data is often prepared by one group of people, analyzed by a second group of people, and then used by a third group of people, and they may have different purposes, and they may have different disciplines. Fifth, longevity. Small data is usually kept for a specific amount of time after the project is over because there's a clear ending point. In the academic world it's maybe five or seven years and then you can throw it away, but with big data each data project, because it often comes at a great cost, gets continued into others, and so you have data in perpetuity, and things are going to stay there for a very long time.
They may be added on to in terms of new data at the front, or contextual data of things that occurred beforehand, or additional variables, or linking up with different files. So it has a much longer and really uncertain lifespan compared to a small data set. The sixth is measurements. Small data is typically measured with a single protocol using set units and it's usually done at the same time. With big data on the other hand, because you can have people in very different places, in very different times, different organizations, and countries, you may be measuring things using different protocols, and you may have to do a fair amount of conversion to get things consistent.
Number seven is reproducibility. Small data sets can usually be reproduced in their entirety if something goes wrong in the process. Big data sets on the other hand, because they come in so many forms and from different directions, it may not be possible to start over again if something's gone wrong. Usually the best you can hope to do is to at least identify which parts of the data project are problematic and keep those in mind as you work around them. Number eight is stakes.
On small data, if things go wrong the costs are limited, it's not an enormous problem, but with big data, projects can cost hundreds of millions of dollars, and losing the data or corrupting the data can doom the project, possibly even the researcher's career or the organization's existence. The ninth is what's called introspection, and what this means is that the data describes itself in an important way. With small data, the ideal for instance is what's called a triple that's used in several programming languages where you say, first off, the object that is being measured.
Here, I say Salt Lake City, Utah, USA, that's where I'm from. Second, you say what is being measured, a descriptor for the data value. In this case, average elevation in feet. Then third, you give the data value itself, 4,226 feet above sea level. In a small data set, things tend to be well-organized, individual data points can be identified, and it's usually clear what things mean. In a big data set however, because things can be so complex with many files and many formats, you may end up with information that is unidentifiable, unlocatable, or meaningless.
Obviously, that compromises the utility of big data in those situations. The final characteristic is analysis. With small data it's usually possible to analyze all of the data at once in a single procedure from a single computer file. With big data however, because things are so enormous and they're spread across lots of different files and servers, you may have to go through extraction, reviewing, reduction, normalization, transformation, and other steps and deal with one part of the data at a time to make it more manageable, and then eventually aggregate your results.
So it becomes clear from this that there's more than just volume, and velocity, and variety. There are a number of practical issues that can make things more complex with big data than with small data. On the other hand, as we go through this course we're going to talk about some of the general ways of dealing with these issues to get the added benefit of big data and avoiding some of the headaches.
- Evaluate the demand for data science in business, research, and consumer technology.
- Assess the careers and skills in data science.
- Review the ethical issues in data science.
- Explore data visualization with graphing tools.
- Discover how data scientists use tools such as Hadoop and Excel.