From the course: Data Science Tools of the Trade: First Steps

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Distributed processing with Spark

Distributed processing with Spark

From the course: Data Science Tools of the Trade: First Steps

Start my 1-month free trial

Distributed processing with Spark

- Unlike MapReduce, Spark is capable of stream processing. Stream processing refers to realtime handling of data. It's ideal when you need instant feedback from a data analytics tool. Let's say that you are developing an anomaly detection tool. In this scenario, you cannot afford to wait until the end of the week when your batch job is supposed to run. You need to respond to an anomaly immediately. What makes Spark fast is it's in-memory processing. All the processing is done in the main memory, which is much faster than storage devices, but the drawback is the cost. The memory chips are more expensive than the hard drives. MapReduce is a default distributed processing solution for Hadoop, but Hadoop allows you to use Spark instead. In fact, there are multiple ways you can use Spark with Hadoop. The first option is using Spark in a standalone mode. In this mode, you can run Spark alongside an existing Hadoop installation. To access HTFS from Spark, you just need an HTFS URL. The…

Contents