From the course: Spark for Machine Learning & AI

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Organizing data in DataFrames

Organizing data in DataFrames - Apache Spark Tutorial

From the course: Spark for Machine Learning & AI

Start my 1-month free trial

Organizing data in DataFrames

- [Narrator] Before we start discussing MLLib, let's take a look at a commonly used data structure called dataframes. Now I'll start Spark. So first I'll show you where I am. I'm in the Spark bin directory. So I will issue the PySpark command. And while that's starting I just want to mention that dataframes are a table-like data structure. They have named columns. But dataframes are used in R and in the Python Pandis library. They're also used in Spark and they're similar to what's available in most Python and in R. Okay, looks like our PySpark interpreter is ready. Now I'm going to clear the screen by using Control, L. And that will give us a fresh screen to start with. This is a Mac and Linux command, but it does not work in windows. The first thing I want to do is load a text file, and this text file is available in the exercise files, so if you have access to exercise files, you can go ahead and follow along and load this file. And this is a file of employee data. So I'm going to…

Contents