In this video, learn about the Cloud Genomics Pipelines API and how to work with BigQuery for genomics.
- [Narrator] As much as I love getting a complex data pipeline working properly, what I love even more is building an even simpler pipeline. Simple, so long as it's functional is almost always better. And, there's an elegance to it that is so important when solving these types of problems. Google's vision for Genomics is expressed in this diagram. And, it's quite beautiful. The idea is that three server-less services will facilitate the research work. The first is Google Cloud Storage. So, true data-like architecture. The processing will be handled by Google Genomics, which is a layer of abstractions sitting on top the distributed compute delivered through Beam and other services. And, the most fascinating aspect to me is the vision for querying. It's via my absolute favorite GCP service, BigQuery. I like to say that Sequel is the most pervasive programming language. Some people would argue that Sequel isn't a programming language, but I would argue, here we're using it to solve one of the most important problems. Let's take a look. Within the past 12 months, Google has made enhancements to BigQuery to support. usage in this type of pipelining. The first enhancement was the variant Schema. As it says here, the Variant Transforms pipeline provides the ability to transform and load VCF file from Genomics directly into BigQuery. You can use BigQuery to run ad hoc interactive queries over Genomic variants using hundreds or thousands of computers in parallel. And, you can also browse the published reference data sets already exported from Cloud Genomics to BigQuery or publicly available data. Now, you can see in the workflow on Cloud Genomics, on the left side here, we have analyzing variants and, the preferred tool is BigQuery. Why I think this is so exciting, is this truly offers the promise of cloud native data pipelining for, as I said, what I believe to be one of the most important computational problems we're working on. And, that's personalized medicine through genomic research. And, you can actually try it out now. I was working with this advanced guide to analyzing variants using BigQuery. Notice the data in the tutorial comes from the Illumina Platinum Genomes project. It was loaded into a BigQuery table that uses the BigQuery variants schema. This is publicly available data. Now, do be aware that you can incur some query cost charges. So, you're going to want to look at the amount of data scanned and see if this is in you budget for learning. Now, what I did is I saved some of the queries from this guide, just to make it quicker. So, counting the rows in the genomics table, now, I've run these queries previously as well. So, I'm going to click run, and you can see that I have a hundred and eighty two million calls in this particular table. And, I'm just going to skip to a more complex query counting variants called by each sample. I'll let you take a look at the query. What I love about this familiar sequel syntax, notice it tells you how much data will be processed, and then you can run it. And then, you can see the results nearly instantaneously. This type of near-interactive query ability over massive amounts of genomic data with server-less GCP services, really embodies what cloud native application architectures enable.
- Enterprise concerns
- Enterprise scenarios
- Setting up your organization’s account
- Managing billing
- Enterprise compute services
- Enterprise storage and database services
- Enterprise data pipelines
- GCP developer and DevOps tools