From the course: Data Science Tools of the Trade: First Steps
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Distributed processing with MapReduce
From the course: Data Science Tools of the Trade: First Steps
Distributed processing with MapReduce
- HDFS stands for Hadoop Distributed File System. To conduct distributed processing on HDFS, we need MapReduce. MapReduce is a batch processing solution. Batch processing involves a data set that doesn't need to be processed immediately when a transaction occurs. Let's say that an e-commerce company stores all its online customer transactions in a database. Imagine that an executive wants a weekly sales report. You can run a batch job every Sunday to process all the purchase data for that particular week to produce the necessary information such as market trends. Since batch jobs don't require real-time processing, MapReduce has the luxury of spending time in splitting a big data set into smaller more manageable chunks. It can then move these newly created data sets across multiple computers until they are ready to be collapsed into a desired result. The splitting and shuffling part of MapReduce is called map tasks while the collapsing part is referred to as reduce tasks. To help you…