Learn how to use the Spark shell.
- [Voiceover] One of the most important aspects of Spark is its use of Resilient Distributed Dataset, or RDD, to accomplish fault tolerance. Once created, RDD can be transformed into another RDD, or you can also take an action on an RDD. Let's create our first RDD of the README file stored in our Spark directory. You can see the README file in the /usr local Spark directory.
So let's quit Spark for now, and let's checkout the README file, type ls. And the README.md file is there. Let's start the Spark shell again. Let's call our first RDD text file type val textFile = spark.read.textFile("README.md") .
Press Enter. Looks like it worked. Now let's take some actions to the newly-created RDD, type textFile.first() . Press Enter. This action returns the first item in the dataset, which is # Apache Spark.
So the line appears first in the README.md file. Let's take another action. Type textFile.count() . Press Enter. This action counts the number of items in the dataset, which is our README.md file. There are 103 items or lines in the README.md file.
Now let's try a transformation. Our goal here is to transform our RDD into a dataset that only has the lines containing the word "Spark". Type linesWithSpark = textFile.filter(line => line.contains("Spark")) .
I forgot about the val statement. V-A-L. Press Enter. Now the lines with Spark is our new RDD. Let's take a count action on lines with Spark by typing linesWithSpark.count() .
Press Enter. As you can see here there are 20 lines containing the word Spark in the README.md file. Here we are going through a complete RDD lifecycle of creating it out of a datafile, taking an action on it, transforming it, and taking another action to a transformed dataset in Spark.
- Enabling technologies in data science
- Cloud computing and virtualization
- Installing and working with Proxmox, Hadoop, Spark, and Weka
- Managing virtual machines on Proxmox
- Distributed processing with Spark
- Fundamental applications of machine learning
- Distributed systems and distributed processing
- How Hadoop, Spark, and Weka can work together