Learn about file systems and their strengths and weaknesses.
- [Instructor] Let us take a look at various technology options available for data storage, starting with HDFS, or Hadoop Distributed File System. HDFS is a massively scalable, distributed file system. It stores files in directories. Being distributed means it can span across hundreds of nodes. Each file is stored in a redundant fashion across the network. HDFS can store any kind of file, including audio and video files.
It can be deployed over commodity hardware and provides a cheap, online storage option for massive quantities of data. HDFS also provides streaming access to data, so reading files can be done quickly, even for very large files. Products, like Hive and Impala, are available to provide an SQL-like query interface to data stored in HDFS. What are the strengths of HDFS? HDFS provides linear scaling across hundreds of nodes, providing the ability to store data and better bytes of data.
It provides built-in redundancy of data storage so no backups of data are required. HDFS has a fairly good security framework that allows permissions to be controlled similar to UNIX file systems. HDFS provides high availability of data with very few use cases for service failures. What are its shortcomings? Files in HDFS can only be added and deleted.
There are no in-place updates to files available, so it is not suited for data that requires frequent updates. Limited querying capabilities are available. MapReduce programs are entered to do queries. Hive and Impala overcome these limitations, only partially. HDFS is not suited for small quantities of data since there is always overhead for data storage and querying. HDFS is best used for raw dumps.
HDFS accommodates any format that makes it easy for a raw dump. They are suited for storing media files, like audio and video. HDFS can also be used as an online backup alternative to offline backups. It provides this capability with cheap hardware, and provides query capabilities too.
Kumaran Ponnambalam begins by discussing the roles of databases in data science, as well as the key feature and performance requirements for databases in this field. Next, Kumaran goes over different database types, sharing the strengths and weaknesses of each one. To wrap up, he walks through specific use cases and shows how to select the best database technology for each situation.