From the course: Big Data Analytics with Hadoop and Apache Spark

Storage formats

From the course: Big Data Analytics with Hadoop and Apache Spark

Start my 1-month free trial

Storage formats

- [Instructor] In this chapter, I will review various options available, and best practices to store data in HDFS. I will start off with storage formats in this video. HDFS supports a variety of storage formats, each with its own advantages and use cases. The list includes raw text files, structured text files like CSV, XML, and JSON, native sequence files, Avro formatted files, ORC files, and Parquet files. I will review the most popular ones for analytics now. Text files carry the same format they have in a normal file system. They are stored as a single physical file in HDFS. They are of low performance, as they do not support parallel operations. They require more storage, and do not have any schema. In general, they are not recommended. Avro files support language-neutral data serialization. So data written through one language, or two, can be read with another with no problems. Data is stored row by row, like CSV files. They support a self-describing schema, and is used to enforce constraints on data. They are compressible, and hence can optimize on storage. They are splittable into partitions, and hence can help in parallel reads and writes. They are ideal for situations that require multi-language support. Parquet files store data column by column, similar to columnar databases. This means each column can be read separately from disk without reading other columns. This saves on I/O. They support schema. Parquet files are both compressible and splittable, and hence are performance and storage optimized. They also can support nested data structures. Parquet files are ideal for batch analytics jobs for these reasons. Analytics applications typically have data stored as records and columns, similar to RDBMS tables. Parquet provides overall better performance and flexibility for these applications. I will show later in the course how Parquet enables parallelization and I/O optimization.

Contents