Understand Databricks cluster sizing options. Review defaults including worker configuration and auto-scaling.
- We're going to start looking at sizing workloads digging in a little bit more to the Cluster Configuration. Now as we're working with Azure Databricks for our Apache Spark implementation, this is really a layer sitting around Spark. Spark itself, of course, is distributed, compute and it works with a driver and then end number of nodes. The driver will host the SparkContext, there will be a Cluster Manager and each node will have importantly, an executor which is designed to hold all of the data in memory for a very fast distributed compute.
And then we'll be able to execute end number of tasks. So we're going to go ahead and take a look at the Cluster Interface for Databricks Azure. So here we are inside of the Databricks Azure portal and I've clicked the Clusters button. So you can see that I've been doing a number of experiments as I've been working with this course. From a high level, there are two types of Clusters there are Job Clusters which we're going to cover in the next section and there are Interactive Clusters. Interactive Clusters stay running either all the time or they will automatically turn off.
That is the default setting if they're inactive. And you'll notice there's different types of Clusters and they're in different states. So I have a couple of Clusters here that I've turned off I have a Cluster that's running and then I have a Cluster with a different type of an icon that's also running. So you'll also notice they're running different run times and they have a number of nodes shown here, who set them up, and then which libraries are associated this is the variance spark library and how many notebooks are working, are associated with the Cluster.
Now if I wanted to create a new Cluster, I would click "Create Cluster" and I would call it "New Demo" and I have two modes. Notice by default I can work with Standard. This is single-use clusters, SQL, R, and Scala but there's a new type, High Concurrency. This is designed to run concurrent SQL, Python and R, it doesn't at the time of this reporting, support Scala and this was called Serverless, previously. So, in the run time version, this is where I select the run time of Spark and Scala.
And if I click the drop down here, you can see the default is 4.3 as of this recording and that's Spark 2.3.1, very important to tick the appropriate Spark version. Now you'll notice some of these choices have associated GPU resources. And again, you're underlying Azure account would have to have GPU set up. Because if I go ahead and I select this, you'll see that for this particular Azure account, it tells me the Spark version is incompatible with the selected driver and if I try to select GPUs, they're not going to be available.
Because that's set at the level of Azure, 'cause GPUs are more expensive. So again, Azure is the outside wrapper which is hosting the Vms in the environment for Databricks. So that needs to be set up either by you, if you're the administrator, or by your administrator if you need GPUs for your particular workload. So, you select your version, so I'll select this version here, you select your Python version and you select your driver and here's the default driver type, the amount of memory, cores and these are Databricks units which is how you pay for this service and then the workers.
Notice, autoscaling is turned on by default, two to eight workers and notice Auto Termination is turned on by default. Also you have the ability set up Spark Configuration and you can tag and you can look at logging and set up in the scripts here. Now once you click this button, that will create a new Cluster and that takes from two to ten minutes depending on resource allocation, where you are, so on an so forth. Now if you click High Concurrency, you'll see that your runtime versions are, there's the smaller list, basically you don't select GPUs here for your driver type you have a different allocation for memory and cores for the worker as well and then notice here, I have a warning, and this is because I have many other Clusters running and this is again associated to Azure.
This account may not have enough CPU cores to satisfy this request and then you'd have to get more CPU quota. This will come up when you're running multiple workloads either on multiple Clusters or you have both batch and streaming workloads running, so again, this is a first level of tuning, because if you have scalability you have to be able to have the appropriate resources available in your Azure account to be able to scale. So once you've set your Clusters up, then you can see them over here, so for example, here's my High Concurrency and you can see that I have no notebooks, no libraries so on, so forth.
I have an Event Log, a Spark UI, Driver Logs, I can have associated apps in the Spark Cluster UI Master. So these are your logs that give you visibility into the overhead of running workloads on this particular Cluster. And these are the ones we'll be drilling in in subsequent movies. Notice you can edit the Cluster, you can Clone it, Restart it, Terminate it, or Delete it. And in addition to working with this interface for the premium skew of Azure Databricks, there is also and API so as you move into production, anything pretty much that you can do in this interface, you can also script through the Databricks API.
- Business scenarios for Apache Spark
- Setting up a cluster
- Using Python, R, and Scala notebooks
- Scaling Azure Databricks workflows
- Data pipelines with Azure Databricks
- Machine learning architectures
- Using Azure Databricks for data warehousing