See a Databricks notebook run a job on a managed Apache Spark cluster running on Azure.
- [Instructor] At this point in the course, I want to show you Databricks in action. Now I've done a number of steps to get us set up, which I'll cover in subsequent parts of this course, but I think it's really interesting to see sort of what the end result is. Here I am in the Microsoft Azure portal for Databricks, and I've already created a cluster of Apache Spark on Databricks. So I've set that up already by just clicking the blue "Create Cluster" button and you can see here is the version, which is 4.3 Includes Spark 2.3.1, and Scala 2.11 Python Version 2.
And you can see that I have a driver of a standard DS3_v2 with 14 gigs of memory, four cores, and I have a number of workers that can go from two to eight. And what I've done here to interact with this cluster, it takes about five minutes for this to spin up one of the sides, is I've uploaded a notebook that I created for demonstrations, so let me go to that. So the first thing that I did once I uploaded this notebook, and I uploaded that by just going here and saying import and then importing the file, is I attached it to the cluster and the cluster was available with the green dot, and I just clicked attach.
So inside of here, we have a Databricks notebook, and you can think of this as a smart IDE or an alternative to using terminal commands. This has a number of different sections, so the first sections here are just marked down. If I click in here, you can see this is just standard markdowns for documentations, it's a Databricks Quick Start in the cell, and then in here, this is instructions on how to set up a cluster, which I've already done. And again, you can see this is marked down, that just renders, and you can see there's a little plus here if I wanted to add a new cell.
What I'm going to do is, I'm going to interact with my cluster, and the first thing I'm going to do, this is a SQL runtime notebook, you can see SQL up at the top here. I'm going to scroll down, and I'm going to use Spark SQL to create a table from a Databricks dataset. So I'm going to click inside of here, and I'm going to run this cell. Now if I wanted to run all the cells that were runnable in the notebook, I would just click this "Run All", but I just want to run this one, and this is Spark SQL, so you can see the command is being sent to the cluster and it's saying that we should create a table from a CSV file that has a header.
You can see in the bottom here, that we have information took two seconds, who was the user, what time, worked okay, and then here we have a view into the Spark logs. So again, this is managed Spark, so as we drill into this, we'll be working with these Spark logs which you can then see how long the various activities took because you're oftentimes when you're working with this optimizing distributed data workloads. That worked okay, now we're going to manipulate the data and display the results, and this is just good old SQL.
Select color, average price, as price from diamonds, group by color, and order by color. We're going to click that, and that's going to give us not only a result, but a visualization, which again is a great part of working with these notebooks that you have the documentation, you have the code runtime, and you have not only the dataview like this, but you have the visualization view and the visualizations are built in and native as part of the offering.
If you click here, you can see there are various types of visualizations and you can see you have plot options, this is without using any external libraries, it's just really useful when you are working with various types of data. If I were to scroll down here, you can see that you can also use other runtimes that are available, and you'll remember from the beginning of this, that I showed you the Python runtime was available. This notebook is set as a default for SQL, so the way that you would invoke another supported runtime, is you would use the percent sign.
Here we're doing the same thing, which is basically creating a DataFrame from a Databricks dataset. And you can see we're using diamonds and we're saying spark.read.csv, we have headers, and here we're inferring the schema. Now before I run this command, I'll scroll back up here, and you can see that when we set this one, we didn't say anything about the schema, and we created a table. As with some other optimizations, there is a table browser here, so if I click data, you can see here is the diamonds table and I have the schema and I have sample data.
So that's again, another useful visualization. So now if I go back to my notebook, and I go down to the Python representation, this is not loading it into a table, rather a DataFrame. And a DataFrame, you might remember, are always an abstraction on top of an RDD, which is the core unit of working with data in Spark. This is going to load it in memory. I'm going to go ahead and do that, and you'll see in the output here, we have some Spark Jobs loading this into our workers, and you can see we have a nice visualization again.
This is a common theme. Databricks has really put a lot of work into the visualizations of their notebooks, and although they're similar to open-source Jupyter notebooks, I really benefit from some of these increased visualizations such as displaying the inferred schema. It's the thing you want to see, basically. Again, we have integration with the Spark Jobs, so you can see here that we had a task take 0.3 seconds, and we have the event timeline, which shows the various activities on our executors, and we have the DAG, or the Directed Acyclic Graph visualization.
Now this is not the most exciting thing in the world, but as we work with more complex data operations, this will become super useful. Now just to complete looking at this, in addition to loading this, we can also display the results. Here we're using Python again, so similar to what we did with SQL. We're just calling a display on PySpark functions, and we're selecting the color and price, grouping by the color, and aggregating the average price. You can see that we run that there, and running Spark Jobs, and again we get the numerical output, which we can quickly change to a graphical output as well.
So there's lots more for us to look at, but this gives you kind of an introduction. Couple other things I want to show you inside of here, you can schedule the execution of these notebooks. You also can have comments associated and this is great for collaborative work, and over here if you make revisions, and let me just make some kind of revision in here, it'll show up and then some more. Then we can click here and you can see here is the revision history, so you basically have an undo, and these notebooks can be checked into GitHub, or whatever source control you're using.
This is a really useful and performant interface to working with your Databricks Spark clusters.
- Business scenarios for Apache Spark
- Setting up a cluster
- Using Python, R, and Scala notebooks
- Scaling Azure Databricks workflows
- Data pipelines with Azure Databricks
- Machine learning architectures
- Using Azure Databricks for data warehousing