From the course: Data Science Tools of the Trade: First Steps

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Distributed processing with MapReduce

Distributed processing with MapReduce

From the course: Data Science Tools of the Trade: First Steps

Start my 1-month free trial

Distributed processing with MapReduce

- HDFS stands for Hadoop Distributed File System. To conduct distributed processing on HDFS, we need MapReduce. MapReduce is a batch processing solution. Batch processing involves a data set that doesn't need to be processed immediately when a transaction occurs. Let's say that an e-commerce company stores all its online customer transactions in a database. Imagine that an executive wants a weekly sales report. You can run a batch job every Sunday to process all the purchase data for that particular week to produce the necessary information such as market trends. Since batch jobs don't require real-time processing, MapReduce has the luxury of spending time in splitting a big data set into smaller more manageable chunks. It can then move these newly created data sets across multiple computers until they are ready to be collapsed into a desired result. The splitting and shuffling part of MapReduce is called map tasks while the collapsing part is referred to as reduce tasks. To help you…

Contents