From the course: Cloud Hadoop: Scaling Apache Spark

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Review batch architecture for ETL on AWS

Review batch architecture for ETL on AWS - Apache Spark Tutorial

From the course: Cloud Hadoop: Scaling Apache Spark

Start my 1-month free trial

Review batch architecture for ETL on AWS

- [Instructor] The first Hadoop pipeline architecture we're going to examine is kind of a traditional one. This is using Batch extract, transform, and load. So it's not using streaming, it's not using Just-in-Time. However, it is leveraging some services and processes in the cloud. Now this one happens to be running on the Amazon Cloud and it's around augmenting healthcare data, cleaning healthcare data, and you can see from the associated link if you want to read the underlying use case in more detail, but let's look at the architecture. So, we start with a manifest file and this stored in S3 file storage and that's going to trigger compute or processing and this is run in the microservices or Lambda architecture. Now that's separate from the whole data world. It's interesting to note that Lambdas are being more and more used in these data pipelines because it's efficient. It's more efficient than using a virtual machine.…

Contents