Join Jack Dintruff for an in-depth discussion in this video Next steps, part of Data Analysis on Hadoop.
- View Offline
- [Voiceover] So now you have literally all of the skills…that you need to process data in Pig and Hive…using all of the built-in functionality that exists.…But there is so much more to explore…now that you have all of these tools in your toolkit.…So first of all, you can create a Pipeline…that will have multiple jobs in it…where one job outputs some data…and then the next job brings that data in…and then outputs more data…and eventually, you get what you care about.…This is what professional production pipelines look like is…is they have many, many, many steps and many tiers…and so those are things that…people in the professional world use.…
Then yeah, find that you actually care about.…I just sort of arbitrarily chose the stack overflow data set…because I needed a DB with users…and it was an easy one that was available.…However, very often, people aren't exactly…interested in the Android community on stack overflow.…So what you can do though, is just go find data…that you actually care about.…So there are a huge number of open source, public data sets…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups