Learn how to execute a WordCount algorithm which operates on a data-stream instead of bound data.
- [Voiceover] So we've talked a lot about stream processing, but we haven't really seen it in action yet. So, I thought now we could run the built in word count algorithm on a data stream. Now that data stream here is going to come from a file that we create. So I want to make sure that I am in the directory where I extracted Kafka earlier, so I'll just take a look. Not there, so I'll go over to that directory. Now I'm going to create a file in here, which just has a few lines on it. And I'm just going to pipe that out to you, a new file called file dash input.
From there what I need to do is create a new topic that I'm going to use for my stream processing. And I'm going to do this using the Kafka Topics Script, which comes with it. I'll tell it to create it where ZooKeeper's running the replication factor of one in partitions of one. I'm not really worried about the replication here, just want to illustrate how this word count algorithm works. Then for a name I'll call it streams-file-input. So I'll just copy this into the command line here. And my new topic was created.
Okay, now we need to send the data into that. So we're going to run the console producer and this time what I'm going to do is for the topic I'm actually going to pipe in the data. I'm just going to do a less than sign and send in that text file. So I'll copy this onto our command line. Alright now the data is in and has been sent to the topic. We need to run our word count application. Now, this is built in and it expects all these parameters that I've set up above, and unlike other streaming jobs that will run forever, this one is going to terminate after a few seconds because it's going to create a new topic and actually spit its results into there.
Then we can go set up a consumer and see those results. So, there were some warnings there but don't worry, it did what it was supposed to do. Now, all I need to do is go fire up the consumer and take a look at what it created. So the topic here is called streams word count output. Everything else is fairly much the same. We have a formator we're passing in and this is just going to give us a nice table layout for the data that comes back. And we're going to print the key and the value of the data that's coming back.
And we have a serializer and a deserializer to handle that actual conversion of that into strings that we can see on the screen. So, I'm going to copy this, paste that into my terminal window. And you can see, there is the word count algorithm that we ran. It essentially processed all the data that's in that file and it spit back each word with the number of times it appeared. This is the simplest form and easiest way to really demonstrate how you can use different algorithms and apply them to data streams.
- Understanding the Kafka log
- Creating topics
- Partitioning topics across brokers
- Installing and testing Kafka locally
- Sending and receiving messages
- Setting up a multibroker cluster
- Testing fault tolerance