Learn how to set up a Databricks job to run a Databricks notebook on a schedule. See how role-based permissions for jobs work.
- [Instructor] In this section, we're going to take a look at some of the advanced features of Azure Databricks, and I'm starting with their pricing page here. So if I scroll down, we are working with the databricks for data analytics workloads, and we have been looking at the standard features. Now we're going to look at features available in the premium plan, so you'll notice that those are role based access control for notebooks, clusters, jobs and tables, JDBC, ODBC endpoint authentication, R studio integration, Delta public preview, and audit logs which are marked as of this recording as coming soon.
So we're first going to take a look at jobs, now you'll notice that regular jobs are in the standard edition, but role based access control is in the premium edition, so let's jump over to our cluster. The first way that we see job in databricks is in the type of cluster. You can see that we've been working to date with interactive clusters, those that are running and they can set to be terminated on interactivity, or they can set to be long running. There is a second type of cluster called a job cluster.
Now for the purposes of testing, I set up some job clusters in advance, so your screen might look a little bit different. Now, as we're working with jobs, we can set role based access control because this has been set up in this premium demo account, and I'll show you what that looks like. So I click over here on my user, and I go over to admin console, and I go into access control, you can see that the workspace access control has been enabled for this account, and importantly the cluster and job access control has been enabled.
So let's take a look at jobs per se, and then we'll look at the access control as well. So over in jobs, we have one job running right now, and this is just basically a quick test, and if I click into it, you can see that I set up the run of a notebook, and here I can edit the parameters, and I don't have any custom parameters. I can edit the dependent libraries. Here is the cluster configuration which I can also edit. I can edit the schedules, this is set to run hourly, and importantly, regarding the permissions, they're available if enabled for the account in this section here.
So I click edit, and you can see that I have the ability to set permissions on this object, so, is owner, can view, can manage or run, and of course, this could be integrated with active directory for an azure subscription as well. So I'll say cancel, and here we can view the runs if we want to, so again, this is similar to what we were doing in the previous section, looking at the performance, but this is as we're moving to more of a production situation. So let's just set up another job to show how this works, so we click create job, and then we can select a notebook, a jar file or configure a sparks submit command, and that's an alternative to using a notebook, so we'll select a notebook, and we'll go into our demos, and we'll select our quick start again, and say okay, and then on our schedule, we can edit our schedule, this is basically just generating a cron job, or we can just run this manually which we'll do in a minute.
So we're going to say cancel, and then, we're going to go ahead and change this to manual, just so we can see what a job run looks like, and then we're going to go ahead and go back to the jobs, click on the job and then we're going to run it. Now notice it turns green, and if we go back to our cluster page, and we refresh, you'll see that we have a new job run pending. So what happens with jobs is these resources for compute are spun up just for the job duration, so they only live for the duration of the job, so that's how they're different from interactive clusters.
So you can see this is pending, it's scheduled to be spun up, once it is spun up, then just like with an interactive cluster, while the job is running, we can go ahead and look at the spark UI, the logs, the so on and so forth. Now in addition to using the web interface to work with jobs, more commonly, my customers will move to the databricks API at this point, and again, this is a premium feature that you would enable the API and then you can script the running of jobs, so this job will take a couple minutes to run and I'll come back when it's completed.
Now we can see that this scheduled job run is running, and we have nine nodes, and we can now click on those, and we can see that here is our quick start, and it took six minutes, succeeded, here's the job ID, the run ID, and other information about the size of the cluster and the parameters. We can also see in the jobs interface that this succeeded, and if we click on it, we have access to the spark UI and logs if we want to drill in.
Again, you can see how nicely these interfaces are integrated into the databricks azure interface. So we can see that this is the historical spark UI and has all of our job ID information, the duration, the description, the resources, so on and so forth.
- Business scenarios for Apache Spark
- Setting up a cluster
- Using Python, R, and Scala notebooks
- Scaling Azure Databricks workflows
- Data pipelines with Azure Databricks
- Machine learning architectures
- Using Azure Databricks for data warehousing