From the course: Data Ingestion with Python
Unlock the full course today
Join today to access over 22,600 courses taught by industry experts or purchase this course individually.
Working in Parquet, Avro, and ORC - Python Tutorial
From the course: Data Ingestion with Python
Working in Parquet, Avro, and ORC
- [Instructor] Some big-data systems, such as Hadoop, Hive, and others, store data in files. They started working with text files, mostly CSV. However, after some time, processing these text files became a performance bottleneck, and new, more efficient file formats came to life. There are few of these formats, such as Parquet, Avro, ORC, and others. We'll see an example using Parquet, but the idea is the same. Find the library for this file format and load it into Pandas. In our case, we're going to use the Apache Arrow library. It's development is led by Wes McKinney, the creator of Pandas. So, we import pyarrow.parquet as pq, and then we say table = pq.read_table('taxi.parquet') And this table is a Parquet table. Now we need to convert it to a Pandas data frame. df = table.to_pandas() And now if we look at the d types, we see that we have the right types and we can look at the head of the data frame and everything looks nice.
Practice while you learn with exercise files
Download the files the instructor uses to teach the course. Follow along and learn by watching, listening and practicing.