Learn who is using Kafka and for what purposes.
- [Instructor] Kafka has become practically the default for streaming analytics, especially for high-tech companies or companies dealing with large volumes of data. One big company using Kafka today, surprisingly, is Walmart. Walmart, the biggest retailer in the United States, possibly the world, has billions of transactions every single day. So all of those transactions need to stream in to a data platform, and some of them need to be processed and handled immediately, while others need to be used for analytics later on.
So there are cases where retailers like Walmart, being very large, have lots of data that they need to process in real time, and Kafka suits those needs for them. Cisco System is another large tech company that does everything from cameras to networking gear to software, and they also have lots of transactions that are happening across all these different systems that they use to run their business. Kafka may not be the central part for everything at Cisco, but it definitely is helping them combine a lot of those different data streams and create a consistent and true understanding of how their business is running.
Netflix, of course, are a high tech company that used data to the fullest extent, uses Kafka pretty heavily. Think about every time you watch a movie on Netflix, or you watch just to a certain point in that movie, they record all of these things in events through Kafka, and they use that to improve their platform, whether that be helping to make better recommendations or just provide a better user experience by remembering where you left off when you were watching that movie. Some other big companies out there, like PayPal, process millions, or potentially billions, of transactions every month, and Kafka is helping them to ensure that those transactions are consistent.
That meaning, that they are accurate and timely when they are processed, and help prevent fraud. This is really big for the financial industry. Spotify is a streaming music service that uses Kafka, and just like Netflix, as you're listening to songs and skipping from playlist or recommendations or liking things, all of those events, those changes, are being captured in Kafka, and then being used to provide a better user experience for you as a user of Spotify.
Now Uber is an interesting use case, in that their product is very real time, where it's connecting riders and drivers, and part of that thing that's happening there is finding out who is where, and where you're going to try to optimize it. All of that data comes in through Kafka streams, and it's then used in real time to find that answer, to connect the rider and the driver, so this is a critical part of their system and how that works. They've actually given quite a bit back to the open source community for additional systems that work with Kafka and make it easier to manage at scale.
Now, I couldn't leave off without talking about LinkedIn of course. There are many companies using Kafka, but LinkedIn is the one that actually invented it. They're the first ones that created this platform, which is now a top level Apache project, meaning it's a open source project that has a large community behind it. And we'll take a look at some of the innovations that LinkedIn has offered to the Kafka community in recent years, but they're the ones, if you go back, that actually created this to start with. And you can take a look, there's a great article here from Jay Kreps, who is one of the founders of it, and what he did is he wrote this post about what every software engineer should know about real-time data, and if you scroll through the article you'll find lots of great info, in fact we'll cover a lot of this in the course, so you don't have to go jump off and do that now, but just as a reference point, this is going to be really important and something to keep in your back pocket for understanding why we're using real-time streaming systems like Kafka.
But what do these companies use Kafka for? I mentioned a few use cases, but the way that a lot of people try to characterize Kafka first is by talking about messaging systems, and these messaging systems are where messages are sent out and other applications are then listening for those messages. They take one of them and then they do something with it. They may also send out additional messages, like, "Hey I'm done, I did what I was going to do, "and now it's someone else's turn." So this is known as a loosely coupled system, and is a really popular architecture with micro services when you're building large-scale platforms.
Now, one of the key features here is that it's a good way to transfer data between systems, that doesn't tie them directly to each other, so there's no interdependencies, this messaging platform acts as a sort of bus in between the two systems. Now another popular use case for streaming analytics is web analytics. So when people visit your website, when they click on certain ads, or they go from page to page, all of those clickstream events you'll want to capture. The path that they went on, the actions they took, all of these can be streamed into your Kafka cluster, and you can even augment the streams in real time based on what they did on their activity inside of the app.
For companies that are operating in a more traditional sense like a manufacturing facility, you can use Kafka to monitor all the machines in your plant, or most likely the sensors on those machines, and process that data in real time to give you some understanding of how the line is operating. You can even apply machine learning algorithms here to figure out if there's going to be a problem downstream and you can try to predict that and prevent it from occurring in the first place. Another common case is log collection, where different applications from different systems that may be distributed all over the world, are generating the same types of log data.
Think access requests to your website or something like that. Now instead of having all of these logs spread out all over the place, and trying to combine certain ones and analyzing them and giving your data scientists headaches because they don't know which ones are the right ones to pull in, you can combine them all with Kafka into a single log collection mechanism. And the last one, as I mentioned, which we're going to talk a little bit about, is stream processing. Stream processing is the idea that instead of batching your data over, instead of taking chunks every night or every hour, you're handling these events or processing them as they occur.
Think if you've ever been traveling, and trying to use your credit card, as you swipe it, while attempting to take money out of an ATM, or buy something from a merchant, there is something going on where that transaction, if it's abnormal, if it falls outside of your normal usage of the credit card, that may trigger an event which then either declines the card, or sends you a phone call, or pops an alert up on your phone. All of these things happen because that event, you swiping your card in an abnormal place, like a different country, is firing off some event that they were listening for.
Now if you only process the data every hour or every day, you may not know what's going on, or those things may not be caught until the following time that that job runs, and that could be really dangerous, especially for a company, or many companies, that process transactions. So sometimes, or often, the investment that it takes to switch to something like a stream processing engine like Kafka, will actually be well worth it in terms of saving you money in the long run by detecting these things up front and preventing them from happening in the first place.
- Understanding the Kafka log
- Creating topics
- Partitioning topics across brokers
- Installing and testing Kafka locally
- Sending and receiving messages
- Setting up a multibroker cluster
- Testing fault tolerance