Start free trial Sign in

From the course: Amazon Web Services Machine Learning Essential Training

VariantSpark and EMR for genomic research - Amazon Web Services (AWS) Tutorial

From the course: Amazon Web Services Machine Learning Essential Training

Start my 1-month free trial

VariantSpark and EMR for genomic research

“

- For the last use case that I want to discuss in this course we need a little bit more background. This is one for which a custom machine algorithm is required. Here we're talking about genomic scale data. So this is an algorithm that was built by the team at CSIRO BioInformatics in Sydney, Australia. It's called VariantSpark. It's an implementation of a custom algorithm that's designed to process the amount of data that we get from sequencing human genomes. In case you're not aware, each sequence provides three billion with a B data points. So the idea is by getting this information and comparing the differences to reference genomes we can provide information that can be used in the creation of personalized treatments for medical conditions. What kind of medical conditions? Conditions like cancer. Again, I know this is a little bit of an aside, but I think it's very important for us to consider the possibilities that we have around the usages of machine learning and big data, and to me, this is the most important, to reduce the amount of suffering from cancer in the world. So I encourage you if you're interested in this domain, I know I have been. I've started to do some work actually with the team at CSIRO. There's a really pressing need in the industry for more people who have technical skills in all aspects of software product life cycles, particularly in machine learning to work with our bioinformatics professionals so that we can take the data that we're getting off the sequencers and putting it to good use. Because the technical implementation of VariantSpark is important to the architecture that we select, let's take a minute or two and talk about how the algorithm was implemented. It's an implementation of random forests which are sets of decision trees. The key differences, the input data is too large, it's both too wide in terms of number of features or columns, because remember the three billion samples, and also too deep because we have to have a large amount of populations to compare against. So what the team did is they implemented a custom random forest, they call it wide random forest, where they split the data both vertically and horizontally and then they recombine it so that the classification, as you can see at the bottom here, is generated from a consensus over all the trees, and then the association is produced. The association means which genomic variants seem to be of greatest interest. And why this is so important is optimizing a run of this algorithm gives researchers working with Cast Nine and technologies like it faster and more useful feedback so that they can work to find solutions around personalized medicine. So let's tie this back to technology implementation. This would be a use case for using an EMR cluster, and the reason for this is the amount of data coming through. Genomic scale data warrants using EMR clusters so that the teams can take advantage of spot pricing, because they need to be able to run at a very economical way very large inputs. So you can see in this architectural diagram there are two different clients, one client is using a Jupiter notebook, the other one is using the command line. Once they issue their commands then the analysis is running. The data persistence is in S3 buckets and the compute is being run on a Hadoop cluster with VariantSpark running on the worker notes.

Contents