Join Dan Sullivan for an in-depth discussion in this video Eliminating duplicates in DataFrames, part of Introduction to Spark SQL and DataFrames.
- [Instructor] Now, when we're working with Data Frames, … Spark provides some ways to de-duplicate data. … So, let's take a look at how to do that. … Now, our data files that we've been working with … the location and temperature data in our … utilization files don't have any duplicate data, … so we'll take this as an opportunity … to also look at how we can create small data sets … to work with within our Jupiter Notebook session. … So, the first thing I want to do, … is import some code that we'll need … from the PySpark SQL package, so I'll … specify from PySpark dot SQL import … the row package, and we have that. … And, now what I'm going to do, … is I'm going to create a data frame … and I'm going to do that by entering data manually here … in the notebook and I'm going to call … this data frame dup because it's going to have … duplicate data in there. … And, to do that, I specify SC, … which stands for Spark Context. … It's a global variable that gives us … access to the Spark Context, and what I want to do …
- Installing Spark and PySpark
- Setting up a Jupyter notebook
- Loading data into DataFrames
- Filtering, aggregating, and saving data
- Querying and modifying DataFrames with SQL
- Exploratory data analysis
- Basic machine learning
Skill Level Intermediate
1. Introduction to Spark DataFrames
2. Installing Spark
3. Getting Started with Spark DataFrames
4. SQL for DataFrames
5. Data Analysis with Spark
- Mark as unwatched
- Mark all as unwatched
Are you sure you want to mark all the videos in this course as unwatched?
This will not affect your course history, your reports, or your certificates of completion for this course.Cancel
Take notes with your new membership!
Type in the entry box, then click Enter to save your note.
1:30Press on any video thumbnail to jump immediately to the timecode shown.
Notes are saved with you account but can also be exported as plain text, MS Word, PDF, Google Doc, or Evernote.