From the course: Cloud Hadoop: Scaling Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Spark architecture for genomics

Spark architecture for genomics - Apache Spark Tutorial

From the course: Cloud Hadoop: Scaling Apache Spark

Start my 1-month free trial

Spark architecture for genomics

- [Instructor] In this next scenario, we're going to look at genomic variant pipelininig that includes Hadoop and Spark. In earlier movies in this course, we talked about augmenting the Hadoop library, such as Spark, with additional open source or commercial libraries, and I actually showed and talked a little bit about ADAM for genomic processing. You may remember that the ADAM set of libraries, which wrap around Spark, include domain specific implementations of items, such as schemas for the incoming files, which are of a specific format. You can see SAM, BAM, or VCF. These files would be coming in from genomic sequencing machines, such as those made by Illumina. This is a simplified pipeline. You see the source files coming directly into Amazon S3, this is an Amazon implementation, and then the focus here is showing that the ADAM libraries are running on top of an Amazon EMR cluster, which is running Spark. In…

Contents