From the course: Cloud Hadoop: Scaling Apache Spark

Modern Hadoop and Spark - Apache Spark Tutorial

From the course: Cloud Hadoop: Scaling Apache Spark

Start my 1-month free trial

Modern Hadoop and Spark

- [Instructor] As we begin our journey looking at advanced Hadoop, we're going to start with what I call modern Hadoop. You might be surprised to know that both open-source and commercial Hadoop is over 10 years old. What I'm finding though, is driving the adoption velocity are innovations. Both in public cloud services around Hadoop, such as those offered not only by Amazon, but also by Microsoft and Google. And also innovations in their devices and domains. Ones that I've had experience with over the last 12-18 months are IoT or Internet of Things and Genomics around bioinformatics for personalized cancer treatment. And we'll talk about these domains but I'll also generalize because the maturity of the Hadoop ecosystem has more and more applicability. Let's look as the core components kind of as a review. Storage, compute and management. In the area of storage we have Hadoop file system. Then we have some vendor optimizations. One we'll be using this course is around the vendor data bricks, which has a commercial version of Hadoop with the library Apache Spark. And that's the data bricks file system. We'll also look at using cloud-based file systems. Such as S3 from Amazon, and the Google Cloud file storage system. These are called data links. In the compute area, my expectation as I mentioned in a earlier movie, is that you will be familiar with the core MapReduce paradigm that Hadoop was originally built on. We're going to be focusing on some of the newer compute libraries that are available. In particular we're going to look at Apache Spark, which allows compute processes to be run in the memory of the worker nodes, and significantly increases the processing speed of the Hadoop jobs.

Contents