From the course: Apache Flink: Real-Time Data Engineering

Setting up the Flink environment - Flink Tutorial

From the course: Apache Flink: Real-Time Data Engineering

Start my 1-month free trial

Setting up the Flink environment

- [Instructor] In this video, I will set up a streaming file source for use in our later exercises. In the exercise project, the com.flintlearn.realtime.datasource package contains stream source generators used for this course. I will first review the file stream data generator. The data generator generates events about audit trails, it creates about hundred files, one for each event in the data/raw audit trail directory. Each event consist of a user, an operation, an entity, a current timestamp in epoch, a duration column and a change count column. We use a random generator to pick these values from a list. The events are based between one to three seconds. Let's run this code to see the files that are created. Each time the script runs, it will clean out the destination directory and create new files. As the files are being created, the messages are also printed on the console. And we can inspect the files to see their contents. Next, we look into how we can consume these files. The code for this is in the basic streaming operations class, under the chapter two package. I have committed out most of the code and we'll uncommit them in the future videos as we progress through the exercise. The first thing to do in any streaming job is to create a streaming environment. We do so using the get execution environment method in the stream execution environment package. Depending upon whether it's running inside an ID, or a job in a cluster, Flink automatically sets up the required pipelines for the tasks. We set the parallelism to one for now. This will ensure that there is only one flink task slot that will process all events in the job. It helps understand the flow better when the events flow and are being processed in sequence. This set of code is common for all future examples. Next, we need to run this source generator to generate data as the streaming job runs. We do this by creating a new thread and running the file stream data generator in the thread. Flink does lazy execution and does not execute any code, unless an execute method or a data right is called. We call the execute method on the stream environment object in order to trigger execution of code. This part is again common to all future examples. In the next video, I will show you how to read from a file source stream.

Contents