Learn how streaming analytics is the new path forward in data analytics and warehousing
- [Narrator] ETL is dead, long live streams. Or at least that's the rallying cry of a lot of new folks that are adopting Kafka for their organization. Now to introduce this topic, I want to first take a look at the typical data pipeline, which we use in data warehousing, known as Extract, Transform and Load, the ETL process. So our process starts with what I like to classify as data providers. The data from these providers is loaded nightly, typically, or sometimes more frequently into a staging environment.
That staging environment could be a database like SQL Server or Oracle or in more modern systems, something like Hadoop, the HDFS, the Hadoop Distributed File System. This is your back office location where the data can then be prepared and shipped over to your data warehouse where your analysts and data scientists will use it. As well, other applications can leverage this data to do things like marketing efforts or business monitoring or even strategic decision making. Now in the Kafka world, we use streaming instead of this ETL batch process, and the way streaming works is that all of your data from your data providers or what we'll call producers in Kafka, comes in to your streaming platform, and inside of the streaming platform, as the data is being written, and actually ingested into the platform, operations are being performed on it.
A simple operation may be to count the total number of orders, or let's say maybe the total number of likes on a Facebook post. All of these things happen in realtime, or with minimal latency. Then the output of the streaming operations is sent to your applications. In Kafka terms, we call these consumers. These are things that are listening for events that have occurred. So in the Kafka world, we can think of our data providers as producers, the things that write data to our cluster.
They are sending data in and on the other side, where we have the use cases, we have our consumers, the things that are actually using the data. Now one interesting note about how Kafka works, is that these consumers can also then rewrite data to another part of Kafka so the consumers can become producers as well. Doing things as I mentioned like aggregating certain elements or maybe sending off new messages or new events to a different Kafka stream, that would then trigger other consumers to take action.
Some of the ways that we can pull data in is by directly writing to our Kafka cluster from our apps, or we can connect to existing apps using a connector. This is nice because most likely, in a corporate environment, whether it be a high-tech startup or a manufacturing company, you have software and systems that help run your business, which may have been in place for a long time and can't easily be updated to work with Kafka. Now these connectors, which will do the work for you by reaching out and finding events or changes in those systems and then pulling them into the pipeline.
Traditional data platforms such as relational databases are a good example. Imagine you have a customer address table in your database, which is updated by a legacy app that can't be changed. You can set up Kafka to look for any time there's a change made, and when those changes occur, they could get pulled into Kafka in realtime, then process them and treat them just like any of the other data that's coming into your streaming platform.
- Understanding the Kafka log
- Creating topics
- Partitioning topics across brokers
- Installing and testing Kafka locally
- Sending and receiving messages
- Setting up a multibroker cluster
- Testing fault tolerance