Join Dan Sullivan for an in-depth discussion in this video Relational databases in data science, part of Advanced SQL for Data Scientists.
- [Instructor] Relational databases are the workhorses of business data management. They are used to collect, store, and manage transaction data like sales transactions, as well to analyze large volumes of data like those found in data warehouses. It is safe to say the bulk of data used in businesses and other organizations is stored in relational databases. This makes relational databases an important source for data scientists. Now, relational databases are more than just a storehouse for data.
They are data management systems that can be used to perform common transformation and analysis operations like linking data across data stores, filtering and reformatting data depending on how it needs to be used, aggregating data to provide kind of a big picture summary, as well as answering specific questions about business operations. Let's consider what goes into a production data science environment. To start, we have data sources that can include relational and nonrelational databases. Those nonrelational databases, by the way, are often called NoSQL databases.
Here, we'll refer to both of these as data stores. Application and web logs are another source of data for analysis. These are typically less structured than relational data and they may require significant filtering, formatting, and data extraction operations. Also, it's not uncommon to also need small manually curated data sets such as reference data that may be maintained in a spreadsheet. Now, we also have tools for extracting, transforming, and loading data. Data scientists might need data from multiple sources.
In these cases, a common practice is to collect data into a single repository prior to analysis. For example, an analyst might extract data from a transaction processing database, an application log, and a personally managed spreadsheet, and then store that all in a relational database. Now, if it's a really large amount of data, the data scientists may use a big data platform such as Hadoop or Spark. Transformations are operations performed on data to make the data more suitable for analysis. Now, we often need to reformat dates, replace missing values with zeroes, and strip out trailing blanks from strings.
Some analysis like building machine learning models can require additional transformations such as changing the scale of numeric values or using even more involved calculations to estimate missing values. Loading an ETL process occurs when data is moved into a data store for analysis. Now, this sounds simple, but large data sets can require specialized techniques like parallel data loads to ensure that the data is loaded in a reasonable period of time. Analysis and modeling tools are used by data scientists for a range of activities like identifying useful information, building predictive models, detecting anomalies, and to make any recommendations.
These tools read data from data stores and may write results back to databases as well. The final result, of course, is insight into business problems and relational databases play key roles throughout the data science process, that ultimately lead to those insights.
The course begins with a brief overview of SQL. Then the five major topics a data scientist should understand when working with relational databases: basic statistics in SQL, data preparation in SQL, advanced filtering and data aggregation, window functions, and preparing data for use with analytics tools.
- Data manipulation
- ANSI standards
- SQL and variations
- Statistical functions in SQL
- String, numeric, and regular expression functions in SQL
- Advanced filtering techniques
- Advanced aggregation techniques
- Windowing functions for working with ordered data sets