From the course: Learning Data Science: Understanding the Basics

Address big data problems

From the course: Learning Data Science: Understanding the Basics

Address big data problems

- Big data and data science have been so intertwined that many organizations see them as one thing. Remember, data science uses the scientific method with your data. That doesn't mean that you have to have a lot of data to ask these questions. Big data provides a robust new source of data. This new source allows you to ask questions that couldn't be answered with a smaller data set. Often, more data points provide more power during statistical analysis. Big data sounds like the title of a 1960s horror movie. You picture some screaming woman in cat eye glasses being swallowed by an oozing mountain of data. In reality, big data isn't really a noun. In the original NASA paper, it wasn't described as a big data problem. You could read this one of two ways. It's a "big data" problem or a big "data problem". If you read the whole paper, it sounds like they put emphasis on the problem. It's not about big data, it's about the problem of data that's too big to store. You also see this effuse later with the McKinsey Report. In the report they refer to big data as data that exceeds the capability of commonly used hardware and software. So why is it important to think of big data as a problem and not a noun? Well, it's because many companies that start big data projects don't actually have big data. It might sound like it's big to you because there's a lot of it. It also seems like it's a problem because it's a real challenge of store and collect, but it's not a big data problem. One way you can determine if you have a big data problem is to see if your data falls into four categories. You can remember these categories as The Four Vs. Ask yourself these questions: Do I have a very high volume of data? Do I have a wide variety of data? Is the data coming in at a high velocity? Does the data I'm collecting have veracity? Will it lead to some useful knowledge or insights? To be big data, it needs to have all four of these attributes. You may wonder whether or not you have a high enough volume of data. The volume question is usually pretty easy. If you're collecting petabytes of data each day, then you probably have enough volume. Of course, this might not always be a problem. In the near future, maybe an exabyte will be considered a high enough volume to be a problem. You also might think if you have a large variety of data. The variety question is a little trickier. Think of the New York stock exchange. They handle millions of transactions each day. They could have a high volume of data. It's also coming in in a high velocity. The stock prices are pouring in and fluctuating in milliseconds, but if you think about it, it's all the same type of data. It's usually just a stock symbol and the price. It's mostly text, they don't collect pictures or sounds or news stories. So they don't have a big data problem. They certainly collect a lot of data, but the technology they have in place should be more than capable of handling the challenge. Finally, do you have enough data veracity? Imagine you want to create a database that collected all the tweets and Facebook posts about your website. You grab videos, pictures, and text, several petabytes of data streamed in your cluster every day. You ran reports to see if your customer felt positive about your product. After you looked through the data, you realized that there wasn't data to determine the customer's mood. All that effort was spent collecting useless data. When you think about big data, try to remember the four Vs. That will help you determine whether or not you have a big data problem. An interesting big data problem is the challenge of self-driving cars. Think about the type of data you'll need to collect. You have to collect massive amounts of video, sounds, traffic reports and GPS data. It will all be flowing into the database in real time at a high velocity. Then the car will have to figure out which has the highest veracity. Is that person on the side of the road screaming because of a sports match? Maybe they're screaming because someone is standing in the road. A human driver has seconds to figure that out. A big data car will have to process the video, audio, and traffic coordinates, then will have to decide whether to come to a stop or just ignore the sound. That's a real big data problem. Try to remember the difference between big data and data science. Big data will allow you to ask more interesting questions, that doesn't mean all interesting questions need big data. Focus on the science. That way no matter how much data you have, you'll always be able to ask the best questions.

Contents