Get an overview of Kafka connectors, their architecture, and why they make building pipelines easier.
- [Instructor] In this chapter, we will explore the concepts and architecture of Kafka Connect. Recall that Apache Kafka is a scalable, distributed pipeline for moving data. Typically, in any implementation, there is a data source like a database, file system or socket. That data needs to be captured and moved to a data sink, which can also be a database or HDFS or any persistent storage.
In order to achieve this, developers write a custom publisher that acquires data from the data source and publishes it Kafka as Kafka topics. They also write a custom subscriber that will read data from the Kafka topics and save it to the final destination. There is a lot of code that needs to be written and things like scalability and fault tolerance need to be addressed but if the source and sink are fairly standard and simple, like moving all new records in a database table to a corresponding sink, why should everyone re-write the same code? Enter Kafka Connect.
Kafka Connect provides a pre-built, productized platform that can make moving data very simple. So, instead of writing custom code, you simply set up an instance of Kafka Connect to read data from the data source and publish it to the Kafka topics. Similarly, you would set up an instance of Kafka Connect to read data from the Kafka topics and save it to the data sink. There is no coding needed.
You simply configure Kafka Connect publishers and subscribers to provide information about the data source and the data sink, including connectivity and scalability information and then Kafka Connect will do the magic for you. Now, let's explore more about the Kafka Connect concepts. Kafka Connect is an open-source product / platform that makes building Kafka Connect publishers and subscribers very easy.
Kafka Connect is totally configuration driven, there is no coding required. It can be set up by operations engineer without any development effort. Additionally, Kafka Connect has scalability and parallelism features built in and they can be simply controlled by configuration alone. Let us now try to explore some more concepts about Kafka Connect. The first one is a Kafka Connect connector.
Connectors are implementations that understand a specific source of sink type. Kafka Connect provides individual connectors for different source types like JDBC, HDFS, et cetera. Third parties can also build connectors from their own sources and sinks and publish them into the Kafka Connect library. Each connector is actually a JAR file that is linked to Kafka Connect.
You simply download and add the required connectors to the class part of your Kafka Connect instance. Next comes tasks. In order to use Kafka Connect, you will set up publisher tasks and subscriber tasks. Each specific publishing or subscribing activity is configured as a task. Each task configuration contains information about the specific connector to use, information on how to connect the data source including port, part, username, et cetera and also about parallelism.
One instance of Kafka Connect can have multiple connectors and multiple tasks set up for each of the connector. And finally, there are workers. Workers are instances and threads inside Kafka Connect that actually execute the configured tasks. Tasks are logical configuration elements, workers actually execute them. Workers can be standalone, in which case a single instance of Kafka Connect runs all the connectors and tasks.
Kafka Connect can also be set up in a distributed mode where there are multiple worker instances that share the tasks. This provides redundancy and parallelism. Kafka Connect can horizontally scale with Kafka. As you keep adding more instances and partitions and topics to Kafka, Kafka Connect can also be configured to take advantage of that. The diagram shows how Kafka Connect can be set up in a distributed mode.
You have three worker instances: worker one, worker two and worker three. There are two connectors, connector one and connector two. Connector one has tasks one, two and three and connector two has tasks one and two. They work with different partitions in the Kafka topics. As you can see, that work is automatically distributed among the different workers by Kafka Connect and it scales horizontally this way.
- What is data engineering?
- Spark and Kafka for data engineering
- Moving data with Kafka and Kafka Connect
- Kafka integration with Apache Spark
- How Spark works
- Optimizing for lazy evaluation
- Complex accumulators