Learn how to load data onto a cluster's HDFS.
- [Voiceover] Data ingestion. You've got a cluster, you've got data. How do you get he two to line up? How do you get your data onto your cluster? That's what data ingestion is all about. So very often if you have data that exists already on a cluster, which is very commonly the case for people who are in a professional environment and using hadoop for their job. In those cases, very often all you need to do is perform is perform a distcp. You can do hadoop distcp -- help and that will give you all the information you could possibly need about distcp. All you need is the path to where your data is and the NameNode address for the cluster you'd like your data to be copied to as well as the directory path on that remote cluster that you would like the data to end up in.
And so what distcp will do is launch a MapReduce job that will take all of this data and stream it from once cluster to another. So that's one way to get data if it's already on another cluster to your cluster. Very often people aren't in the situation they just have data locally on a local file system, they want to throw it on the HDFS. So what you can do is use the hadoop command. You can say hadoop fs -put and then give the directory in the local file system for the data that you want to put on the HDFS. Followed by the path on HDFS where you would like this data to be copied to.
In general, HDFS will hold any data, but the format of that data matters if you want to try to read it in MapReduce. So if you have binary files on HDFS let's say, you better have some loading function that can read those binary files in such a way that it makes sense to Pig. In this course, we'll be using Pig storage, which just let's you specify a delimiter and so in this course you can use comma separated value files, which is what we've chosen to use, but with Pig storage, you can also do tab delimited value files, you can do all sorts of things.
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups