From the course: Learning Data Science: Understanding the Basics

Get data into warehouses using ETL

From the course: Learning Data Science: Understanding the Basics

Get data into warehouses using ETL

- Many data science concepts build on previous work with relational databases. Companies have been capturing and analyzing data for decades. Even today, the relational database management system is the cornerstone of enterprise data. You need to understand our DBMS terms on your data science projects. One place you'll likely run into them is when you're focused on data warehouses. An Enterprise Data Warehouse or EDW is a special type of relational database that focuses on analyzing data. Traditional databases are optimized for online transaction processing or OLTP. A data warehouse is used for analytical processsing or OLAP. Think of it this way. The typical database is focused on working with data in real-time. A data warehouse is focused on analyzing what already happened. Imagine you have a website that sells running shoes. You've hired an engineer to create your database. They've created dozens of different tables and relationships. There's a table for customer address, a table for shoes, a table for shipping options and so on. Your web server would use a SQL statement to search the data. When a customer buys a shoe, it would join their shipping address to the shoe then it would give them their shipping options. You'd want this database to be fast and efficient. It's a transactional OLTP database. It should be optimized so that when a customer finds their shoe, joins it to their address and ships it to their house, you want your database to focus on speedy returns. Your customer wants to do this in real-time. You may ask your database engineer to create a script that uploads the data each day to your warehouse. Your data warehouse is optimized for analytical processing. It's an OLAP database focused on creating reports. You might have a data analyst create a report to see if there's any connection between a customer's address and the types of shoes they buy. You might find that people in warmer areas are more likely to buy brightly colored shoes then you'd take this information and update the way your website presents the data. You can change your website so that customers from warmer climates see lighter shoes at the top of the page. Now, say that your website has been very successful. It was bought by a company that sells sports clothing. That company has one warehouse for all their websites. They want to take the data from your website and combine it with all their other websites. The company will do something called ETL which stands for extract, transform and load. They'll pull the data from all their websites then they'll load the data into their Enterprise Data Warehouse. The company will extract the data from your website in a standard format then they'll transform the data into something that works well with their data warehouse. Their warehouse could have a different schema than yours. The data analyst will spend most their time scrubbing and joining the data so it will fit. Finally, they'll load the transform data into the data warehouse. A typical data science team will have similar challenges. They'll also very likely use the same language. One thing you might hear is we need to ETL the data in the warehouse so we can move it into the Hadoop cluster. What this means is that the data analysta need to transform your data so they can move it into your cluster. Another area where this may come up is that many organizations see Hadoop as a replacement for data warehouses. You see a lot of companies rewriting their ETLs so they can load data into a Hadoop cluster then they faze out or shut down their warehouse. Organizations that are hoping to save money by storing data on inexpensive hardware. Data warehouses are usually a costly appliance. In either case, you'll certainly run into these terms and concepts as part of your data science team. Try not to get overwhelmed by the language. Unfortunately, much of data science is still about gathering the data. You'll spend many hours in ETL meetings before you can ask any interesting questions. If you understand the terms and challenges then you're more likely to get the data that you need.

Contents