From the course: Learning Data Science: Understanding the Basics

Keep things simple with structured data

From the course: Learning Data Science: Understanding the Basics

Keep things simple with structured data

- When you're on a data science team, you'll often deal with many different types of data. These different types will be a key factor in determining how you want to store your data. Technologies like NoSQL will give you a lot of flexibility to store different data types. Relational databases give up some flexibility, but they're often easier to work with. When you think about how you want to store your data, you need to understand the different data types. The same is true with any storage. Certain databases are optimized for certain types of data, just like you wouldn't want to store a sandwich in a water jug, you wouldn't want to set up a relational database to hold the wrong type of data. There are three types of data that your team should consider. There's structured, semistructured, and unstructured data. The first type of data is usually the simplest, it's commonly referred to as Structured Data. It's the data that follows a specific format in a specific order. Structured data is like the bricks and mortar of the database world. It's cheap, inflexible, and requires a lot of upfront design. A good example of structured data is your typical office spreadsheet. Imagine that you're filling your rows with data, you have to stick to a pretty rigged structure. Let's say you create a column called Purchase Date. Each field in that column needs to follow a strict guideline. You can't put Tuesday in one row and then March in the next, you have to follow the correct format. Each row has to be the same format. Maybe you'd use a standard format like a numerical month followed by a slash and then a year. This structure is called the data model. Structured data relies on this data model. A data model is similar to a data schema in a relational database except the schema works to define the whole database structure. Remember, a schema shows you how to organize your relational database. It includes the table, the relationships, and how they interconnect. A data model defines the structure in the individual fields. It's how you define what goes into each data field. It's there that you decide whether or not the field will contain text, numbers, or dates. If you think about your spreadsheet example, you can probably see the problem with ignoring your data model. Imagine you did just put Tuesday into the date field, most spreadsheets will let you do this without any problem. Then on the row below, you put March. Seems easy enough, it doesn't even feel like you're doing anything wrong. The problem happens later. Now imagine you want to create a report that displays all of the purchases in March, how would you do that? Would you use the number three or would you use just the word March? You certainly wouldn't put in the word Tuesday. If you did, then your spreadsheet would have a lot of data garbage. Every time you'd try to sort the data or create our port, there'd be a bunch of rows with invalid data. Then you have to go back and clean it up or just delete them from their port. That's why many spreadsheets create formatting rules. These are the rules that force you to follow the model when you're entering data. The same is true with databases. Many databases will reject the data that doesn't follow the model. Off on a website or whatever middleware you use to collect the data will be set up for a specific type and format. Relational databases excel at collecting structured data, that means that there's a lot of structured data out there. A lot of the data that you access on websites or mobile apps will come from structured data. Your bank statements, your flight information, a bus schedule, even your address book is a form of structured data. That doesn't mean that most data is structured. Actually, most data does not follow a specific format and structure. In fact, some of the most interesting data doesn't follow any structure at all. Data like videos, pictures, and audio have no defined structure. Think about when you upload a picture from your mobile phone. You could be taking a picture of anything and you can be anywhere. It could be terrific quality or just a grainy mess. It could be a large file or a small file. There's no structure that's included in the data that will help the database store the file. As part of the data science team, you need to combine this type of data with the method of collection. If you use a relational database, then you're going to be limited to mostly structured data. If you use a NoSQL cluster, then you'll be able to work with all the data types, but it will be more difficult to create reports. These are decisions that your team needs to think about. Remember that data science is about applying a scientific method to your data. The data will be the raw material that allows you to ask certain questions. As a team, decide what material you need, that way you can get the most interesting insights.

Contents