From the course: Learning Data Science: Understanding the Basics

Share semistructured data

From the course: Learning Data Science: Understanding the Basics

Share semistructured data

- Data science teams work with many different data types. Relational databases are often the best choice if you have structured data. You'll need a strict data model for structured data to fit into a predefined database schema. It's like a spreadsheet where you'll have your fixed columns and rows. With structured data, it's usually pretty straightforward to create reports. You can use the structured query language or SQL to pull data from your database and display it in a standard format. When your structured data is nestled in your relational database, everything in the world seems organized. It's like when you have all your spices in their spice jars. You know where everything is and you know exactly where to find it. The problem is that few applications ever stay that simple. Let's go back to our running shoe website. Imagine you're using a relational database. You have the four tables. There's a table for your shoes, the customers, their address, and shipping options. All of your structured data fits into a data model. The dates are standard. The zip codes are standard. Things are running smoothly. Everything seems right in the world. Then you get an email from your shipping carrier. The carrier says that you can dramatically lower your costs by adding information directly into their database. You just need to query their database, then download one of the regional shipping codes, then add it to the order, and create a new record. It should be easy because their database is like yours. It's all structured data, and it's all in a relational database. The problem is that their schema is not the same as your schema. You called your zip code data ZIPCode. They called their zip code data PostalCode. You don't care if the shoes are shipped to a business or a residence. They do. You don't specify whether or not it's a house or an apartment. They have different rates for each. Now you need to find a way to exchange your structured data with their structured data even though each of them has a different schema. To solve that, you need to download the carrier's data and the schema. When they send you an address, it needs to include the field names and their data model. When a customer orders a shoe, your database will send the zip code to their database. It will give back a bunch of data that include their version address with their field names. Remember that they use the field name PostalCode for zip codes. That will be included in the new data. The carrier data has some qualities of structured data. It will be well-organized. It also has a standard format. The text fields will always be text. The date fields will always be dates. But the data will include their schema. The carrier can use whatever names they want. That's why this type of data is called semistructured data. Semistructured data is even more popular than structured data. It has structured, but that structure depends on the source. You work with semistructured data all the time. Your email is semistructured data. It has a pretty consistent structure. You always have a sender and a recipient, but the names and contents of your field might vary. Data science teams will typically work with more semistructured data than structured data. These are the volumes of email, weblogs, and social network sites which can be analyzed. There are a few common ways to work with semistructured data. One of them is XML. This is an older semistructured data type that's used to exchange information. Then there's JSON or JavaScript Object Notation which is an updated way to exchange semistructured data. This is often the preferred data type for web services. That means that your running shoe site is more likely to get JSON data back from the shipping carrier. Including semistructured data is a good way to ask more interesting questions. Suppose we're interested in customer feedback. Are your running shoe customers happy with his or her order? You may download semistructured data from some of the largest social media sites. Then you have to combine that semistructured data with the structured data that you have on your customer. If your customer isn't happy with their shoe, you can send them an apologetic coupon. These are the types of questions you can ask when you combine structured and semistructured data.

Contents