From the course: Data Ingestion with Python

Unlock the full course today

Join today to access over 22,600 courses taught by industry experts or purchase this course individually.

Working in Parquet, Avro, and ORC

Working in Parquet, Avro, and ORC - Python Tutorial

From the course: Data Ingestion with Python

Start my 1-month free trial

Working in Parquet, Avro, and ORC

- [Instructor] Some big-data systems, such as Hadoop, Hive, and others, store data in files. They started working with text files, mostly CSV. However, after some time, processing these text files became a performance bottleneck, and new, more efficient file formats came to life. There are few of these formats, such as Parquet, Avro, ORC, and others. We'll see an example using Parquet, but the idea is the same. Find the library for this file format and load it into Pandas. In our case, we're going to use the Apache Arrow library. It's development is led by Wes McKinney, the creator of Pandas. So, we import pyarrow.parquet as pq, and then we say table = pq.read_table('taxi.parquet') And this table is a Parquet table. Now we need to convert it to a Pandas data frame. df = table.to_pandas() And now if we look at the d types, we see that we have the right types and we can look at the head of the data frame and everything looks nice.

Contents