Learn how to display data in a DataFrame and explore it using shell commands.
- [Narrator] Let's explore now some of the options that we can do with DataFrames. First we're going to read in some data, then we'll inspect the data a little bit, we'll do some aggregations, and lastly we'll perform a filter. Here in my Databricks environment I've loaded the 2.4 example from the exercise files. And we have a lot of the things we did in the previous clip but I'll just run through them again in case you didn't see that. First we need to find some data. So we have a CSV file here under the Databricks datasets online retail data-001/data.csv.
We read that data in and we infer the schema and we use the first row as the header row, which gives us essentially a nice DataFrame to work with. So before I run this I need to attach this to my cluster. And then hit the play button. Okay so now we have our DF, our DataFrame that has the schema and everything loaded correctly. Let's take a look at that schema. So if I hit play on this one you can see that we have a function here called printSchema. And it shows essentially the data types of everything and whether or not they are nullable.
So this is a good way to look at your schema and make sure that Spark didn't mess anything up when you loaded that CSV file in and you told it to infer the schema. Okay, so if we just wanted to take a look at one column here I don't want to see the entire DataFrame I just want to see one of those columns, I can use the select function and then .show at the end of it. So it will actually show me just that one column. So select is an operator similar to what we do with SQL where we pass in a list of columns from our DataFrame, and it'll return just those. So if I wanted to remove duplicates and sort, I can run this operation here as we did in a previous clip, where we select just the country, we use the distinct function which again is the same as SQL, and then we use an orderBy.
I hit play, run this, and we have now a distinct list of countries from our dataset. Okay, now let's create an aggregation. So here what I want to do is calculate the order totals. So I'm using display to see the results. I have my DataFrame there, then I do select, and I am doing it this way there are many different ways you can actually reference these but, I give it the DataFrame name so DF, and then in brackets I put the column name I want. So invoice no for invoice number, unit price. And then I actually multiply unit price times quantity to get the order total so for each invoice, I'm going to multiply unit price and quantity, then I'm going to group by invoice number and perform a sum operation.
So I'd have multiple results here, and I'm going to group all of those together so I have a distinct list so only one result per invoice number with this calculation apply. I hit play. And you can see my results. Good to go. Now if I want to filter this a little bit, let's take a look and this is essentially like doing a where clause from SQL. I have my DataFrames so DF.filter, and then I'm going to specify the invoice number equals an invoice number that I'm interested in. Cause I wanted to take a look and just make sure that these actually added up.
So when I hit play here, what you'll see is essentially everything from a DataFrame so the full dataset, with just that invoice number and you can verify that in the first column there. Okay so I have quantity times unit price. And if I were to manually do those out and then I could go verify it up above, it should add up to $38.09. Alright, so let's take a look if I wanted to do another type of filter where I just wanted to see the top 10 products in the UK. So what I have here, a little bit more advanced function.
I select the country, description, then I have a calculation here, my unit price times my quantity. And I give it an alias of total, I want to have a nice name that I can use. Then I'm going to actually use a group by of country and description. So it's going to group by multiple fields, perform a sum operation meaning that that third one there that total is actually going to be summarized. Then I'm going to filter where country equals United Kingdom, and make sure you use the double equals here it may trip some people up especially if you're coming from a SQL background. Here in this context you need double equals which is actually the testing of whether or not something equals something else.
If you only use one you're actually assigning that which doesn't work in this context you may get an error. I'm going to sort that using the total, and I'm going to do ascending equals false so it's actually going to go descending order highest to lowest. And limit it to just 10 results. So there's a lot of stuff going on in that one little function there. Let me hit play. Our Spark job runs, you can see our country, the description, and the total. And you can verify that yes indeed it is in descending order from highest to lowest. You can see Dotcom Postage is the most popular product or the highest selling product in the UK for our data.
- Understanding Spark
- Reviewing Spark components
- Where Spark shines
- Understanding data interfaces
- Working with text files
- Loading CSV data into DataFrames
- Using Spark SQL to analyze data
- Running machine learning algorithms using MLib
- Querying streaming data
- Connecting BI tools to Spark