Run a demo notebook which uses a third party machine learning library for bioinformatics, variant-spark.
- [Instructor] As we continue looking at how to work with clusters, I'm going to pull an example from the real world. So this is some work I've done with a team out in Australia at CSIRO Bioinformatics, and the use case is processing huge amounts of genomic data. So here is an article that talks about the use case for our next example. Now the sample notebook that we're going to be working with demonstrates a number of capabilities available in Databricks.
So the first capability is the ability to use an external library. In this case it's a JAR file. So for this particular example the team at CSIRO wrote a custom algorithm, an implementation of the random forest machine learning algorithm specifically for their data, for genomic data. So it's a wide random forest and it's called VariantSpark. It sits on top of Spark. What's really important to understand is if you're going to work with an external library you need to and get that library and bring it in and bring it into your cluster.
So I'm going to show you how that process works for this particular case. Now you notice that this notebook is written with Scala as the default runtime. So the description here talks about VariantSpark and what is the Bioinformatics use case. And rather than analyzing a real disease condition, 'cause this is medical data, and so there's privacy around it, what the team did kind of humorously to show researchers how this algorithm can be used, is they created what they call a synthetic phenotype.
Now in English that means a fake disease. The disease in command two here is being a hipster. Now even though it's a fake disease it does have a basis in genetics. So here are references to the papers. So they basically came up with four attributes that would define somebody as a hipster and then they created this diagnosis, if you will, this hipster index. So what their algorithm is going to do is in the associated data it's going to find samples that have these genomic variants, and then this is used as a validation of the algorithm.
As it says, we demonstrate the usage of VariantSpark to reverse-engineer the association of the selected SNPs, those are the data points, the validations, to the phenotype of interest, or being a hipster. Now that aside, sort of more generally, what's interesting about this notebook is that it uses this external library. Now in order to work with this we had to use a specific version of Spark because VariantSpark is compiled on Spike 2.2. So if I scroll down you can see here's the cluster setup instructions.
Now I set the cluster up in advance because it does take a couple of minutes, but I went over to Clusters. I created a cluster for this particular use case, and you can see that the runtime here is a different runtime. So if I click Create Cluster, you can see that the default runtime is Spark 2.3. And we have support already up for Spark 2.4. So this is an important point when you're creating your cluster. You need to figure out which version of Spark your particular team needs to support. In my case, for the VariantSpark workload it's 2.1.
So once I did that then the second thing I had to do is I had to associate this external library. Now the way this works as your Databricks is in the workspace you go to your particular location, and you say Create Library. And then you have a number of places that you can get your library files from depending on the source programing language they're written in. So you can see you can upload JAR files, Python Eggs, or Python Wheel files into DBFS, or the file system you can again retrieve these types of libraries that are stored on the location in the file system.
You can use PyPi package repository to get Python files. In our case we used the Maven repository because that's where you would have JAR or Java and Scala files. And CRAN of course is for R. So in our particular case we had to get the coordinates. And then here are the coordinates and this is the Maven repository. And we simply had to take these coordinates, paste them into this location, and then this was uploaded as a set of libraries.
Now I'll go ahead and show you what this looks like. You can see once you have the library uploaded, and it does take a couple minutes because we have dependencies here. Then what you do is you select the cluster that you want this library to be associated with. In our case it was the variant-spark-demo cluster that was set up on Spark 2.2. And then you're set up to run your notebook having properly set up the dependencies. Now I've gone ahead and I've run this notebook already because it does take a couple minutes to run 'cause it does work with quite a lot of data, even in the sample.
So the way that this works is once you set the cluster up then you load the data, and they're using Python here. And we're loading out of not csv, but we're loading out of genomic-specific vcf files. And they have actually been compressed using bz2. And so we've loaded our data and then we're loading our variants. So this is implementing the algorithm. And we have a deprecation warning here. This is another useful feature of working with a Databricks notebooks. This'll still run but it just shows when it's a recommendation to update some of the interface code because the Databricks teams are updating the code as they are updating the underlying Spark versions.
So what this does is this uses Scala to import the VSContext which is a wrap around the Spark Context, and the ImportanceAnalysis, which is a wrap around the wide random forest. And then we're passing in the Spark instance to the VSContext. Then we're passing in the feature file. And these are the prelabeled who are hipsters. And we're looking at the output here, HG0096 so on and so forth. Those are the genomic files that would be coded as hipsters.
So in other words we have the right answer that we're going to verify against when we run our custom machine learning algorithm in this particular case. Now we're going to load our labels. So here's our labels, 0, 1, 0, 1. So this is telling us whether or not somebody's a hipster. And then we're going to configure our analysis. So we're going to pass to our ImportanceAnalysis the featureSource, the labelSource, and then importantly the number of trees. And so this is a random forest algorithm. So this is how many trees we're going to be using.
Then we're going to run our analysis. And this is sort of the heavy weight lifting inside of this particular workload, because what we're doing is we're distributing this across our worker nodes. Now for this demonstration example we have a small number of worker nodes. We have three worker nodes. But, again, just to give context to this, when I run this in production with partial or full GWAS or genome-wide samples we can have 50, 100, 200 worker nodes. And that is again why you use Spark because of the sizes of data you're going to compute against.
Then this is showing you which variables are of most importance. And again, you can use the great graphing tools inside of here and you can quickly graph that either in lines or in a bar chart. Or if any other way that you want to graph it that is meaningful. And this tells a little bit of detail about the algorithm itself, and then we're showcasing in this notebook using some of the other visualization methods, so you might remember this from some of the earlier videos, but here we're actually implementing this in a production use case where we're using SQL and we're loading this information into a SQL table and then we can easily visualize that with the built-in visualization tools.
Now if we want to use custom tools as some of the people on the team did, some people were more comfortable with matplotlib, so they were using Python here, and basically taking a similar type of a plot, these are the different characteristics that define someone as a hipster, and here they're plotting them. Again, you can see them across the bottom, using Python plotting tools. You might remember from an earlier video you can also use R. And indeed they have a person on their team who's more comfortable using R. So again, this is very reflective of the real world.
So that person wanted to load a Spark R package and wanted to take those same variables, and they wanted to use ggplot. And here they're taking the same information, these are the variants of interest, and they decided to flip the axes horizontally and vertically. The team has a sense of humor so they created an infographic showing the hipster characteristics. So you can actually do this if you sequence yourself genomically and you can run your results through and see your proclivity towards hipsterdom by having variants that reflect monobrow, fabulous hair or beard, characteristic of your retina that makes you prefer checked shirts, and coffee consumption.
Then, as is commonly done in a lot of machine learning experiments, the team wanted to compare their results of VariantSpark, or wide random forest, with another popular tool that was specific to their domain, and that's Hail. And Hail does logistic regression. So the method of doing machine learning analysis is different here. And so they go through, they load data again, and then they get the results. And then they do a plot and they're using R here.
And this is the result of the plot. These are the variants. And the variants are shown differently. VariantSpark results are in the lighter color, the salmon color, and Hail results are shown in the green color. So you can see, although there is overlap of the results, for example in this variant, Hail produces more results. So the point of this particular experiment is to distribute a high volume of compute across multiple machine learning libraries and to visualize the output so that they can understand the usefulness of the various machine learning libraries.
Again, it really showcases the power of both Spark and Databricks pulling together in this demonstration notebook.
- Business scenarios for Apache Spark
- Setting up a cluster
- Using Python, R, and Scala notebooks
- Scaling Azure Databricks workflows
- Data pipelines with Azure Databricks
- Machine learning architectures
- Using Azure Databricks for data warehousing