Join Jack Dintruff for an in-depth discussion in this video Streaming UDF, part of Data Analysis on Hadoop.
- View Offline
- [Voiceover] So, there are two types of UDFs. The one that we've used before is a typical pig UDF, where you give it an exact function. Everything has to fit within this archetype. Whereas, a streaming UDF is much more generic and very often you can write it in just about any language that you want. The reason being that it just reads from standard in and writes output to standard out. And so, if you write program that has that functionality, you can use it as a streaming media, which allows you to use just about any language under the sun. The advantage a streaming UDF is that there are a lot of built in Bash functions that already read from standard in and write to standard out.
So, if you can manipulate those and give it different parameters and kind of finagle in doing what you need it to do, you can essentially create a UDF without writing any code. Which is pretty cool. All you have is a little Bash one liner. One of the disadvantages, however, is that all of that I/O right. If you're sending everything to standard in and then sending everything out for standard out, to significant amount more I/O then would be caused by a regular UDF and for that reasons, streaming UDFs, I don't want to say always, but I will say that I have yet to encounter a streaming UDF which is faster than regular UDF.
I think, in general streaming UDFs are slower than if they were implemented as a regular UDF. Most UDFs are fastest when written in Java. For us probably the second fastest is Python, just because Python has to get translated and there's some overhead there. But yeah, for the most part, you wanna stick with regular UDF. So what we're gonna do is we're gonna load up the data set including the header line. What that's gonna do is, it's gonna let us pass the whole file through either head or tail, to get either just the header or just the data.
In this case, we're going to just grab the header and dumb that out to standard out. So, we're gonna ahead and open up the streaming UDF script. It's very, very simple. We're just doing a load of the comment table including the header and this is where magic comes in. We're going to stream some relation which we just loaded in through and then we're going to give it with backticks, the Bash command that we would like it to pass through. So, all this does is it calls head, it says get me one line from the very top of this and then return.
And then, we're just gonna dump that to standard out. Right, so we launched the job, you can see again this is a Map-Only job, which makes sense because we're applying a UDF with a for each, we're not doing any sort of reduce operation. The things that require reducers are groups, distincts and order-bys. So we got as output ID, post ID score. So this is exactly what our header is for the comments file.
In this course, software engineer and data scientist Jack Dintruff goes beyond the basic capabilities of Hadoop. He demonstrates hands-on, project-based, practical skills for analyzing data, including how to use Pig to analyze large datasets and how to use Hive to manage large datasets in distributed storage. Learn how to configure the Hadoop distributed file system (HDFS), perform processing and ingestion using MapReduce, copy data from cluster to cluster, create data summarizations, and compose queries.
- Setting up and administrating clusters
- Ingesting data
- Working with MapReduce, YARN, Pig, and Hive
- Selecting and aggregating large datasets
- Defining limits, unions, filters, and joins
- Writing custom user-defined functions (UDFs)
- Creating queries and lookups