Join Dan Sullivan for an in-depth discussion in this video Introduction to DataFrames, part of Introduction to Spark SQL and DataFrames.
- [Instructor] In this course we'll be using Spark, which is a platform for distributed data processing. And it's particularly well-suited for dealing with very large data sets, such as data sets that are so large they don't readily fit within the memory or capacity of a single server. Spark has a modular architecture. There's the core platform, which is called Apache Spark Core. And then there are a number of modules, which run on top of the core platform. We're going to talk mostly about Spark SQL. In the last section of the course, we'll also look at a couple of machine learning libraries. Spark Streaming and GraphX are other modules in the Spark architecture, but they're outside the scope of this course. Now, Spark supports multiple languages, including Scala, Java, Python, and R. We'll be using Python here. We are particularly interested in a data structure called DataFrames, and DataFrames are basically sets of data that are organized into columns. The columns have names, and the rows have a schema. So in this way, they're very similar or analogous to tables in relational databases. Here's an example of some data in a DataFrame. In this case, we have time series data. And that means we have a date time associated with each row, and then we have certain measurements that were taken at that particular time. Now this time series data shows some basic performance monitoring data. For example if you were monitoring a server, you might want to know its CPU utilization, the amount of free memory, and the number of sessions connected to a particular server. And you'd also want a server ID. And that's the kind of data that we have in this example, and we'll be seeing more of this data in upcoming videos. Now, another thing we want to keep in mind, is that DataFrames have a specific structure. So again, like relational database tables, there's a formal structure. And here is the structure, or schema, for the time series data we just saw. And like with database tables, we can have a mix of data types. In this case, we have doubles, which are a type of floating point. We have strings for the date time. And we have some logs, for example the server ID, and the counts.
- Installing Spark and PySpark
- Setting up a Jupyter notebook
- Loading data into DataFrames
- Filtering, aggregating, and saving data
- Querying and modifying DataFrames with SQL
- Exploratory data analysis
- Basic machine learning