Start free trial Sign in

From the course: Big Data Analytics with Hadoop and Apache Spark

Reading external files into Spark

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Reading external files into Spark

“

- In this chapter, I will demonstrate options available to ingest data into HDFS with Spark. We will be using the Zeppelin notebook titled Code_03_XX Data Ingestion with Spark and HDFS. Navigate to this notebook at sandbox-HDP 9995. On opening the notebook, you will find that Zeppelin is similar to Jupiter notebooks in many ways. We can create paragraphs each with a different interpreter. The Code can be executed by clicking on the Run button. Results will display immediately below the paragraph. In this video, we will focus on reading external data into Spark. Spark provides connectors to a number of external data sources including a local file, a file from HDFS, or even a Kafka Topic. The first paragraph here is to test if Spark is successfully installed and running. The %spark2 in this first line indicates the interpreter to use. We can run this paragraph with a run button and the results will show up in the bottom. We see that the current version of Kafka is showing up correctly. So we are good to proceed with the other exercises. In the next paragraph, we read a CSV file. Since Spark is running under Yarn in the sandbox it uses HDFS as its disk. We will link the sales_orders.csv file that we uploaded earlier in the course into a data frame called a rawSalesData. We set the option for Header to tell Spark to consider the first line of this file as the Header. We also specify inferSchema equal to true, Spark will examine the first few lines in the file to infer the data type of each column. It will also use the Header line to name the individual columns. We then pin this schema for the data frame as well as the first five rows to make sure that the data is read correctly. Let's run this code now and review the results. We can see that the schema as well as the data shows up as desired. In the next few videos, I will show you many ways of parallelizing this data and storing in HDFS.

Contents