From the course: Learning Data Science: Understanding the Basics

Collect unstructured data

From the course: Learning Data Science: Understanding the Basics

Collect unstructured data

- We've gone through a lot, so let's recap a little. In general your data science teams will work with three different data types. There's your structured data. That's the data that's most like the data in your spreadsheet. It has a set order and a consistent format. It's usually stored in a relational database. Then there's your semistructured data. That's the data with some structure, but there's added flexibility to change some of the field names. Finally there's the most popular type of data, there's everything else, it's the unstructured data. Some analysts estimate that 80% of your data is unstructured. When you think about it this makes a lot of sense. Think about the data you create every day. Every time you leave a voice mail. Every picture you upload to Facebook. The Microsoft Word memo you created at work or the PowerPoint presentation. Even when you search their web, it's mostly unstructured. That search for cats will bring up videos, songs, books, and even music. So what does all this data have in common. Well that's one of the key challenges. The short answer is not much. It's schemaless. Remember that schemas are like a map that shows the data's fields, tables and relationships. You won't have that with unstructured data. With unstructured data the format depends on the file. A Microsoft Word document might have a set format yet that format is only used by that application. It's not the format for all text documents. That's why you typically can't edit Microsoft Word documents in another program. That also means that there's no set data model. There's no consistent place to look for field names and data. How could you figure out the title and contents of dozens of different types of files? What if some of them were PDFs, Microsoft Word documents, and PowerPoint presentations? Each one of them has its own format. There's no field to look at that says document title. This is a challenge that search companies like Google and Bing have been working on for years. How do you work with data that has no set format and without a consistent data model? Every time you search these engines you'll see the fruits of their labor. If you search for a term like cat you'll see all the ways they found text, videos, and pictures, and audio. Working with unstructured data is one of the most interesting areas in data science. The newer databases like NOSQL allow you to capture and store large files. It's much easier to store it all in one place. All that audio, video, pictures, or text files can go into a NOSQL cluster. You can scale out your servers and use similar tools and software. If you want to capture everything there are new tools for that as well. You can use big data technology like Hadoop for processing data in a cluster. Then you can work with that data using MapReduce or Apache Spark. So let's go back to your running shoe website. The business has grown a little and now you're part of a new data science team. You work with marketing and management to come up with your first interesting questions. Who's the best running shoe customer? You gather up some basic biographical information. It was pretty easy to find in your customer database. You have their email address and the city and state where they live. You take that information and start crawling through the customer's social network posts. You start to gather all the unstructured data. Maybe your customer posted a video of finishing a marathon. You can send out a congratulatory tweet. You might also decide to start crawling through your customer's friends posts. Maybe your friend posted an image of running with a group of people. You can use unstructured data to identify these people and send them special promotions. These projects are typically called a 360 degree view of your customer. You're trying to find out everything you can about what motivates them. You can then use that information to find your best customers and send promotions. You also may find that you have a few customers that are referring a lot of their friends. You may want to offer them special incentives and discounts. As time goes on you can capture more and more of their unstructured data. Then you can ask more sophisticated questions. Are your customers more likely to travel? Are they more competitive? How often do they go to restaurants? Each of these questions in their own way can help you connect with your customer and sell more products. Unstructured data is a resource that increases every day. Think about the things you did today that might be interesting to a company. Did you send a tweet about your long walk to work? Maybe you need better shoes. Did you complain about a rainy day? You should buy an umbrella. Unstructured data allows these companies to offer that level of interaction.

Contents