In this video, explore a scenario about variant processing pipelines using Google Genomics.
- [Instructor] Some of the largest volume and most complex data pipelines that I've ever worked with are around bioinformatics or genomics use cases. Now, this architecture really is a little bit dated for many of my customers. The aspect that I want to focus on is the processing aspect, and that's the elastic cluster, which is HPC Cluster running on compute engine. This is a kind of a lift and shift scenario where an on-premise cluster was simply moved up onto the Google Cloud, and it really reflects that type of thinking where you would simply move virtual machines and have administrators still be responsible for setting up, managing, monitoring, sizing the cluster, so on and so forth. Although that scenario does have usefulness for some of my customers, it's really interesting to see how Google has taken some of the services that we've just been looking at and added extensions to them so that they can address this use case in a more cloud native way. If you were thinking, "Well, wouldn't it make sense "to use a serverless type of approach, "such as we just looked at with Dataproc "or Dataflow or Composer, rather than having "a serverfull or server-based HPC Cluster?" And if you're thinking that way you're thinking the way Google's thinking. Now, initially their architectures did use Beam, or Cloud Dataflow, to replace the HPC Cluster. What's been exciting to see is how Google has partnered with the research community and added APIs on top of Cloud Dataflow that are specialized for the type of workflow that is genomics. In fact, the API originally was called the Genomics API, and it has since been renamed to the Pipelines API, and the idea is that this API provides a higher level of abstraction when working with these batch type of complex computational processing workloads that are part of genomics, but also part of other types of verticals these days. So, as of this recording the Genomics API has been renamed to the Pipelines API, and the genomic variation of it is called the Variant Transforms tool. Notice in the documentation it's an open source tool that's used with Cloud Genomics. It's based on Apache Beam and uses Cloud Dataflow, so you can think of it as an abstraction layer that sits on top of Dataflow to make it easier for researchers to be able to construct pipelines so they can ask questions of the massive amounts of genomic data that they're working with. So, you can see you can use the tool, you can transform and load hundreds and thousands of files, millions of samples, and billions of records in a scalable manner. It also includes a pre-processor, which you can use to validate your VCF files and identify inconsistencies. This is really important because previously researchers had to resubmit jobs that has data errors in them and it really slowed down their work. So, the typical workflow comprises the following: storing raw VCF files in Cloud Storage and using the Variant Transforms tool to load the VCF files from Cloud Storage into BigQuery. Now, if you want to try this out, if you're using a free tier GCP account you'll have to do a quota increase, because even for the Quickstart to use Cloud Genomics Pipelines API to create an index file, or a BAI file, from a large binary file containing DNA sequences, or a BAM file, you will have to get more CPU quota and more disc quota even to run the small example. When I'm telling you that these workloads are huge, I really do mean huge. So, to take a look at what it would look like, if we go to step three here, and you can see it's still in alpha as of this recording, so the API will probably mature as industry gives feedback to Google, gcloud alpha genomics pipelines run run in the east, and here we're running samtools, which is a bioinformatics tool, based off a Docker image of samtools, and we're getting the data out of a public bucket. Again, this is another trend about big data pipelines it's important to pay attention to. Genomics is kind of leading the way here. Google, along with other cloud vendors, is increasingly hosting the reference data up on their cloud so that you can get to work faster because you don't have to pull in the reference data, and then it shows the output. Basically it's a single command that creates a series of Beam jobs so that this data can be processed in a time- and cost-sensitive manner. It's really changing the way genomic researchers do their work, and I'm excited to be a part of it. In addition to the extensions that Google has done to Beam with the Genomics API, they've got more teams working on this problem. At the cutting edge is something called DeepVariant, highly accurate genomes with deep neural networks. The idea here is to apply machine learning using distributed compute. It really is some of the most complex work that I've seen out there, and it's exciting that it's being applied to human health. And you can see that this article talks about the release in open source of DeepVariant, which is "a deep learning technology that is "designed to reconstruct the true genome sequence "from the HTS sequencer data with significantly "greater accuracy than previous classical methods." This work is the product of more than two years of research by Google Brain in combination with Verily Life Sciences, and it's using deep neural networks along with distributed compute to get the most accurate representation of human genomic information. This diagram gives you a partial window into the complexity of this data pipeline. This type of problem space is some of the most exciting, most interesting, and most compelling that I've ever had the opportunity to work with. It's really interesting to see how machine learning and enterprise data pipelines are starting to converge around human genomics.
- Enterprise concerns
- Enterprise scenarios
- Setting up your organization’s account
- Managing billing
- Enterprise compute services
- Enterprise storage and database services
- Enterprise data pipelines
- GCP developer and DevOps tools