From the course: Learning Data Science: Understanding the Basics

Make connections with relational databases

From the course: Learning Data Science: Understanding the Basics

Make connections with relational databases

- Data scientists will work with data in many different forms. They take data from a legacy database or even old spreadsheets. They'll also work with pictures and videos. You should be familiar with common ways that organizations store their data. Most organizations have a wide range of ways to store their data. Some of them are very modern others less so. The best way to understand these different technologies is to start from the beginning. Even the most modern databases often build on technology which is over 50 years old. Modern databases really got started with the Apollo Space Mission in the late 1960s. NASA worked with IBM to create an information management system or IMS. The rockets that would go on to the moon required millions of parts, the space agency had these early manifests which looked very much like a modern spreadsheet. It was a computer file with a series of columns and a long list of rows. A table with a million rows is difficult to manage. Imagine looking at a million row spreadsheet on a small black and white terminal. IBM later commercialized the IMS that they created for NASA. In the mid-1970s they developed a Structured Query Language to help their customers pull data from the system. Right around the same time the first relational databases were being developed these databases divided the data into groups of tables. Each of these tables still look like a spreadsheet but contain a smaller chunk of the data. Then they created relationships between those tables. Instead of having one long list of a million parts they could create 50 tables with 20,000 parts. That's why these are called relational databases. The database is based on groups of tables that have a relationship with each other. Even the early database engineers struggled to learn an efficient way to group database tables. They created maps that showed how the tables related to each other, they called these maps schemas. Even with these early databases you can see how the engineers might have struggled with creating schemas. Should they create tables around the largest parts? Maybe make a table for just the thrusters, then another table for the fuel tank. The problem here is that if you change the design of the rocket then you have to change the design of the database. Maybe you could create tables based on the manufacturer of the part, the problem with that is that you might have a manufacturer that produces thousands of parts and another manufacturer that only produces a few dozen. This continues to be a challenge today. Relational databases require a good amount of upfront design, you have to know a lot about what your data will look like before you start collecting. If you're wrong then it takes a lot of effort to redesign your database. SQL is an elegant language that can pull data from several different relational tables. It can reconnect different tables and present the data in a virtual table called a view. SQL is so popular that it's still one of the most widely used query languages. If you search on LinkedIn then you'll see it's still one of the most popular skills. Over the years a lot of functionality has been added to relational databases. That gave rise to the relational database management system or RDBMS. Some of the largest software companies like IBM, Microsoft, and Oracle still support and develop relational database management systems.

Contents