From the course: Learning Data Science: Understanding the Basics

Sift through big garbage

From the course: Learning Data Science: Understanding the Basics

Sift through big garbage

- Unstructured data brings a whole new set of challenges. One of the first questions you run into is whether you ever want to delete some of your data. Remember that a data science team uses the scientific method with their data, you want to be able to ask interesting questions. So you need to decide if there's any limit to the questions that you'll ever want to ask. There are good arguments to keep and throw away parts of your data. Some data analysts argue that you'll never know every question that you might want to ask. It's also relatively cheap to keep massive amounts of data. Usually only a few cents per gigabyte. You may as well keep it as opposed to making real decisions about what to throw away. It might be cheaper to buy new hard drives than it is to spend time in long retention meetings. On the other hand, some analysts argue that you should throw away your data. There's a lot of garbage in those big data clusters. The more garbage you have, the more difficult it is to find interesting results. Some analysts call this data noise. This is a real struggle. Many data science teams are still trying to figure it out. How do you deal with all that big garbage? I once worked for a company that was facing this challenge. They owned a website that connected potential car buyers with auto dealerships. They created a tagging system that would record everything that their customer looked at while on their website. Anytime they rolled over an image, the database would add a new record. Everywhere they went, all the links they hit were collected by this system of tags. The system grew into thousands of tags, each of these tags had millions of transactions. There were only a few people in the company that understood what data the tags captured which made it difficult to create interesting reports. They could capture how many people rolled over a tag but only a few people knew what the tag meant. They used the same tagging system with their unstructured data. They started collecting advertisements and flash videos. They wanted to connect the tag to the image and then the transaction. That way, they could see the image that the customer clicked on. Then there was a tag that said where it was located on the page. It all went into their growing cluster. Some people on the team argued that much of the data was obsolete, only a few people knew the tagging system well enough to understand the data. The advertisements were constantly changing so they started to rename the tags. So much of the data was already obsolete. Others argued that this was just a very small amount of data captured compared to what could be stored in a hadoop cluster. Who cared if you had a couple of extra gigabytes of obsolete data? It wasn't worth the effort to clean up. These are the kinds of challenges you'll deal with as well. There are a few things to keep in mind. The first is that there really isn't a right answer. Your data science team just needs to figure out what works best for them. If you decide to keep everything, then you probably have to work a little harder when you're creating interesting reports. There'll be a little bit more filtering and a little bit more noise in your data. If you decide to throw away your big garbage, then you'll have a cleaner cluster. Yet there's some chance you'll inadvertently throw away something you might one day regret. It's almost like when you clean out your closets. You'll never know if that wide collared jacket will come back in style. If you keep too many jackets, then you may forget what you already have. The most important thing is to make sure that your team makes one decision or the other. You don't want to have a data policy that changes every few months. Either decide at the beginning that you plan on keeping everything or you'll want to throw some things away. Then work with the team to make sure that everyone agrees about what can be thrown away. If you don't have a set data retention policy, then you're in some danger of corrupting your data. If you don't know what you've thrown away, then it's going to be very difficult to make sense of reports. Try to decide early what works best for your organization.

Contents