Join Dan Sullivan for an in-depth discussion in this video Install Spark, part of Introduction to Spark SQL and DataFrames.
- Now let's install Apache Spark. Before installing Apache Spark I just want to make sure to remind you that you need to have Python installed. If you don't have Python installed already I suggest you go to www.anaconda.com/distrubution and download the Anaconda Distribution for Python. Now I'm running on a Mac so when I go to Anaconda.com/distribution I'm showed a couple of options for installing Python on MacOS. If you're installing on Windows or Linux you'll see something different. Regardless of what operating system you're using choose the Python 3 option to download. Now I'm not going to go through the installation of Python so assuming you have that installed then we need to go to the Apache Spark website. That website is www.spark.appache.org/downloads.html. Now here you can choose a particular version of Spark. Now I'm going to choose the latest version, which is 2.4, and I'm going to choose the package that includes Apache Hadoop. And with a combination of those two choices we'll generate a link here, and you just click that link and that'll bring you to an Apache Software Foundation download site where you can just choose one of the mirrors and start downloading. There, we have it downloaded now. So I'm going to show this in finder. And you'll see that it's a compressed file. Now on the Mac, if I double click that'll decompress the file. Now I have a directory called Spark-2.40-bin-hadoop2.7. I'm going to rename that. And I'm going to simply call it Spark. There, now I want to run Spark from my home directory so I'm going to drag this folder into my home directory. And that should be in my home directory now. And now I'm going to start a terminal window. So I'll just type terminal and I'll print working directory, I'm in my home directory. So I should be able to CD to Spark. And let's just do a listing. Okay this looks correct and we'll CD into the bin directory and you'll see we have Spark commands here. So we're in the right place. Now there's one other thing we want to do just to make it easier to use Spark and that is to set up some environment variables. So I'm going to clear the screen. I'm going to navigate to my home directory by using CD and in the Mac I can just use the tilda as a short hand for my home directory. So I print my home directory. Now I'm going to edit using the nano editor. A file called bash_profile. Now what you'll see here is a number of lines that are related to Spark. You will need to add these lines to your bash profile file. And basically what we're doing is we're exporting environment variables. There's one environment variable for Spark home. There are two different PySpark environment variables we want to set as well. I'll include these export commands in the exercise files so you can simply copy them into your bash profile. So we will exit from this. Now once you're done editing, if you want to run your bash file you can do that by typing source.bash_profile. And that will execute the command. And that's all you need to do to install Spark.
- Installing Spark and PySpark
- Setting up a Jupyter notebook
- Loading data into DataFrames
- Filtering, aggregating, and saving data
- Querying and modifying DataFrames with SQL
- Exploratory data analysis
- Basic machine learning