And today we will look at another feature of Kafka known as Kafka Connect.
What is Kafka Connect?
Kafka Connect is a tool or means for reliably streaming data between Kafka topics to other data systems. The other systems here can be various storage systems like file systems, databases etc.
For this purpose, Kafka connect uses something called connectors. These connectors monitor changes on the database or on data push on the topics and operate on them.
There are two types of connectors: 1. Source Connectors, 2. Sink Connectors. But before we go further into them you must be wondering what the job of the connectors are more like of our consumer only, then which one is better? Or what exactly is the difference between them? So let’s first address these questions.
Differences between Kafka connect and consumer
With Kafka connect you bridge a gap between topics and external systems you might not have control over. This way you can maintain all the records on the topic to persistent storage. But even then there are some privileges that you need.
But whereas a simple consumer is dead-simple, uses consumer groups to consume data in parallel. With a normal consumer, you have to take care of a few things like offset management and commits. But this problem doesn’t exist with connectors. Kafka connect automatically commits the offsets.
It turns out in terms of efficiency both Kafka connect and Kafka consumer is the same. The difference is that not only connect handles the commits on its own but it also provides the guarantee of exactly once paradigm. Whereas it is very difficult to achieve exactly once in a normal consumer.
Now let’s get back to connectors and explore both of them.
Kafka Source Connector
The main task for source connectors is to read any changes in the external systems like database and stream the data to a Kafka topic. Each table in the database creates a different topic.
Kafka source connectors are built using connect source APIs. Which then again are built on top of Producer API. Source Connectors are easy to setup up.
The only limitation of source connector is that if you don’t find any suitable connector for the proprietary system you are using then you have to write your own source connector. Now writing source connector is easy but debugging and is very hard.
Kafka Sink Connector
Sink connectors read data from the topic and stores them into the system like a database. For each topic, it reads it creates a separate table for them. In order to store data from topic to table, you need some kind of converters like JSON converter or AVRO converter. By default, the JSON converter is already enabled. But if you are using AVRO then make sure Kafka schema registry is running along with your Kafka.
Its limitation is the same as source connectors.
But just running Kafka connectors is not sufficient to make it work. There is some configuration you need to make in your external system to make it work. In the next article, we will set up a sink connector with Postgresql database server. I am sure you will be able to understand each part but even if you don’t there are multiple articles available which I will provide at the end of the article for your help.
Before we wrap this up there are few points I want to emphasize on.
You need to understand that Kafka connect has a very limited scope. Its main focus is streaming data to and from Kafka topics. So if you are new to Kafka or you don’t have control over the system code then using connect is a good choice.
But if you have full control over your code and you have some experience with Kafka then go with Kafka client (consumer). Writing a bunch of code for your consumer in order to publish data streams is a very time-consuming job.
So if you have very limited usage of Kafka where you have message-at-a-time situation, you can simply set enable-auto-commit and avoid the tricky parts of using a consumer.
Here is a blog which will give you insights for on how to make a large data pipelines with kafka connect https://www.confluent.io/blog/announcing-kafka-connect-building-large-scale-low-latency-data-pipelines/.