From the course: Data Science Tools of the Trade: First Steps

Distributed file systems

From the course: Data Science Tools of the Trade: First Steps

Start my 1-month free trial

Distributed file systems

- Distributed file systems are becoming important in data science due to the high volume of data that needs to be processed these days. Other than volume, we also need to address the velocity and variety of data. Think about it. We generate data every passing moment from highly diverse sources, such as social media and various sensors. The conventional ways of storing and processing our data is no longer adequate for this kind of big data. This is why we need a specialized solution like distributed file systems. The Hadoop Distributed File System, or HDFS, is one of the most widely used distributed file system technologies. One of the strengths of Hadoop lies in the use of commodity computers, which keeps the cost down significantly. You don't need any special hardware and can even use your home PCs to build a Hadoop cluster. Here, the term cluster simply refers to a group of computers connected through a communication network to work on a given task. Since it isn't feasible to handle big data on a single computer, HDFS breaks the data into smaller, and manageable, chunks to be stored and processed on a cluster of many ordinary computers. To help with this, Hadoop has a mechanism called MapReduce to manage this process of distributing the workload and collecting the results of the processing done on each individual machine. In addition to MapReduce, a number of software tools are also available to support the Hadoop ecosystem. For example, Spark is a flexible alternative for MapReduce, which can be used side-by-side with Hadoop. If you need a data warehouse software that can leverage the Hadoop cluster, you can use Hive. The nice thing about Hadoop is that it is an open-source solution and anybody can use it for free. However, the challenge is the total cost of ownership. It requires an extensive expertise to install and maintain the Hadoop system. As always, there are pros and cons to every situation.

Contents