Learn how to write a Python UDF.
- [Voiceover] So, all we're gonna do in this script…is to register the UDF that we've already created.…Then, we're going to load up the users database.…We are going to generate the ID for every single user,…as well as a bag of names…split on spaces.…So, we're going to get their first name…and last name separately.…And, what we're going to do is,…we are going to do a nested for each…so that we can get the unique names…by calling distinct after grouping all the names together,…and then we're just going to generate those unique names,…and store it out since this is likely going…to be a very large output, and again we're going…to use comma separated values.…
So here is the UDF itself, you can see it's quite simple.…It is written in Python, we had to designate…an output schema which is going to be a bag…of tuples containing names.…We are just passing this function a string,…and it's going to generate a bag…if the string is empty we just return that empty bag,…but if it's not we're going to go on,…and split up that string by spaces,…
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups