Streaming is a relatively new concept in building up Big Data pipelines and Big Data solutions and in this video Lynn Langit walks you through some of the newer services offered by Amazon. You will learn different scope, size, speed, and analysis of streaming data through Amazon’s Kinesis and Firehose data pipelines.
- [Voiceover] In this section, we're going to take a look at some of the newer services offered by Amazon, and these services facilitate a new big data scenario, that is, streaming. So streaming is a relatively new concept in building up big data pipelines and big data solutions, so I'm going to take a moment and explain it before we dive in and work with the offerings on Amazon. When we consider streaming, we want to think of it as opposed to what we're using now, which is batching. So let's start with the idea of scope. When you're batching, or putting in multiple records at one time, we're scoped to the batch.
Streaming allows us more flexibility. So when we stream, we can use data within a rolling time window, or even the most recent data record. So why would this be important or interesting for scenarios? Well, as more and more data providers come online, particularly IoT, they can be sending data at a volume that we've not had to deal with before. So we might want to just grab some of the telemetry or event records out from the rolling window so that we can get information about our device more quickly.
This goes to the next difference between streaming and batching, which is size. We can work with a single record or an important concept called micro-batching of a few records, and what that means is from our window, the stream of data, we can mark the beginning and the end, and we can grab out a small amount of records and then this goes to the next vector, which is speed. We can then pull these records into other services, whether it's EC2, to display this on a website, or we want to pass it through for more processing, and then we can look at it in terms of seconds or even milliseconds, near real time.
And the last consideration around streaming is the type of analyses that are going to be done on this information. It's really common in these types of scenarios, particularly IoT, where I've been doing a lot of work over the past 12 months, where you just want the status of the device. For example, for your thermostat you want to know what is the temperature. For your sprinkler system, you want to know what is the moisture sensitivity in the ground and whether or not your sprinklers are turned on. You just want to know a yes or a no. You don't really need anything complex in terms of processing.
Now, that being said, you can do complex processing over streaming, but I'm kind of getting ahead of myself. So let's continue learning about what streaming is. So as we're thinking about this, I want to bring in the different offerings that Amazon has. The one that we're going to focus on is something called Kinesis and a new version of this, called Firehose. In addition to this, they've announced something called Kinesis Analytics that we'll talk briefly about. These are what we're going to focus on in this section of movies, because these are the core products and offerings from Amazon.
However, in addition to this, with my real world customers I will sometimes build streamed and batch pipelines. And there is a product in Amazon called a Data Pipeline that can be used to pull these two things together. And also for completeness, some of the scenarios around streaming are true big data scenarios. They have data in the terrabytes or petabytes and those can be implemented with different types of streaming technologies to make better use of the volume for those products.
And those often work with Hadoop in terms of open-source libraries such as Apache Storm and Spark. We're not going to focus on these here because many of the solutions that I build are on what I call the medium data or the large data, rather than the huge data scenarios, 'cause they're newer to the companies, they just don't have that much data coming in, so the Amazon products that are offered for streaming, Kinesis and Firehose, work just great and they have reduced complexity as opposed to those that are designed for the huge volume, the Hadoop streaming, for example.
Now to get us thinking about this, I have a picture so that we can kind of visualize what this is going to look like. So we have some stream producers coming in, and that's on the left side of this picture. We can have just servers, we can have mobile clients, we can have desktop clients, we can have machine to machine, and that's done by the EC2 instances. The thing that I also work with quite frequently is IoT devices, so what facilitates the use of this data in a faster, more near real time, is Amazon Kinesis streams.
Now I really like the way this is shown. It's this huge pipe that has smaller pipes inside of it and these are called shards. This is a key aspect of working with Amazon's Kinesis streaming. It's to properly allocate the number of shards. And we'll see that when we get into working with the product. Coming out of the Kinesis streams in the Amazon ecosystem, you create Kinesis applications and they're designed to be hosted on EC2 instances. And then, most commonly, you will pass the data on from the ingest that comes through the stream out to other Amazon services that are shown on the right.
Now, we've got a list of data services here, but we can also do further processing on them. For example, if we wanted to just look at telemetry information that was outside the range of normal. For example, if we had sprinkler systems and the watering was way outside the range of normal and we wanted to generate alerts, we could do more computation on these and that could be done with more EC2, or Elastic Beanstalk, or Lambda, if you're remembering back to when we talked about computation in an earlier movie. Now we'll see when we actually work with one of the Kinesis products that one of the key aspects is setting up the proper size of your big pipe, if you will, as we saw in the last picture there.
And to do that, you want to set up the right number of shards. Now Amazon realizes that these concepts are new, so they actually have a shard calculator that we'll be working with, but just so we can understand when we're getting into this, each shard allows for one megabyte per second of write capacity and two megabytes per second of read. So you're gonna need to do some data forecasting when you're working with streaming solutions, in terms of what is the size of each record or each data point that's coming in, and what is the velocity that you need in terms of a business requirement.
It's a different kind of requirements, and so I wanted to just share that with you. Also, Amazon has a formula that they recommend so the number of shards is the maximum of the incoming right bandwidth in kilobytes, divided by a thousand, and the outcoming read bandwidth in kilobytes divided by 2,000.
Starting with top-level categories of storage, data, computer, and services, Lynn guides you through planning your ideal AWS architecture, providing service demos using the AWS Console, command-line interface, and other tools. Learn when to use which service for which business case, such as Docker or Lambda or DynamoDB or Aurora? She shows how to script creation of services such as S3 buckets and EC2 instances, create and populate a managed data warehouse, and develop a data processing pipeline that works for you. Chapter 6 covers the AWS Internet of Things (IoT) services.
These exercises can help you build proof-of-concepts, minimum viable products, and deployable solutions to scale and support big data initiatives at your company.
- Setting up your AWS account
- Using AWS tools
- Defining your minimum viable products
- Choosing computer, storage, and data services
- Using S3, EC2, or Docker for website hosting
- Developing an AWS website
- Using a data warehouse
- Developing a data processing pipeline
- Developing an Internet of Things project with AWS