See the recently released SQL-on-S3 serverless services, S3 Select, and Athena in action. Preview serverless ETL with AWS Glue.
- [Instructor] It's a really interesting time, in terms of planning data scalability in the Amazon ecosystem. Particularly, when you're going to have to design data systems that support SQL or even ANSI SQL queries. What Amazon is doing, is they're really recognizing the reality, of the situation for customers. And that reality is, as I mentioned in a previous movie, that most of their customer base, or their potential customer base, is using relational database servers for operational workloads, and SQL-based servers for data warehousing.
There are exceptions to this of course, but that is the, sort of typical, architecture. That being said, there are a large number of customers, that are really interested in getting the cloud-scale economies, of using serverless pattern for scaling data, so Amazon is trying to meet both of those customer needs and providing a menu of products to pick from. This can be a little confusing, and I think what helps, to start to understand the possibilities, is to look at some of these new services. So we're gonna actually do that. So, just from a review perspective, server-based services are of course, RDS, which has been around for many years, and is the most popular in the Aurora, or scaled MySQL version service, in the data suite of Amazon services, because most customers are just simply moving their data in.
Then there's Redshift, which is the data warehouse-optimized version, on the EMR platform, there is support for Spark SQL, if needed. Now, on the serverless front, very new as of this recording, is general availability of a service called S3 Select, which I'm actually gonna demo in a minute. Relatively new, less than a year old, is a service called Athena, as well as a service called Redshift Spectrum. All three of these are serverless. And they're designed to perform SQL queries on S3 data. There's also a new service that's integrated, which I call ETL-as-a-service.
That's called AWS Glue. So, because I'm assuming for most of this audience, these services will be new, we'll do a quick look at how they basically operate, so you can get an understanding of where they fit in your data-scaling possibilities. So, here we are in the console, and we're in an S3 bucket. So, let's start with a new S3 Select service. The concept behind S3, and I'm gonna open up this sample set of baseball data, is that it is useful for businesses to be able to do quick views of their data in S3.
And as companies start to test out moving towards data lake, they'll be working with CSV files, or structured text files, in S3. There are other formats supported but it's just one to start with. So I'm actually gonna just select one of these files, and then I'm gonna click on this drop-down by More, and this is also exposed via the API, but it's just simple to do it from the console, and I'm gonna click on this new option, Select From. So you can see, S3 Select allows you to extract records from a CSV, or JSON, file, as of this recording, using SQL expressions.
It does support gzip, and server-side encrypted files. And you can use the console to take a quick look, basically to extract up to 40 megabytes of records from source files up to 128 megs. If you wanna work with larger files, you're gonna have to use the API for this. And if you wanna use more complex SQL expressions, you use a different service, Athena, which we'll look at next. So you can see that it's detected that this file is CSV, and that it's comma delimited. Now I happen to know it has a header row, it's not encrypted, and there's the total file size.
And then I could turn the flag on if the file was compressed. The first part is I'm going to show a file preview, and I can see here's the file preview, it's previewed properly. And then I'm gonna click Next. So once I scroll down, then here's a sample SQL Expression, in the editor, select * from s3object s limit 5 and I'll scroll down and I'll run the SQL. And you can see here's my result. Now if I click on the sample SQL expressions, this is designed to help customers to understand what subset of the ANSI SQL language is supported in S3 Select.
So for example, this is if I wanna select some of my column names, so a subset. So in my column names I have, player ID and year ID. So let me just Append SQL, and then select player ID, and it has to, it's case sensitive, and year ID. Actually I don't know it to be case sensitive, but I'm assuming it is because of the environment. It's always a safe bet. And click Run SQL. And there is our result. So why would you use this? What's the purpose of this? Particularly as it relates to the data scalability? The idea here is that, if you are wanting to be able to work with some small subset of data, that's stored in your data lakes, or your S3, you can use SQL Select just quickly and easily, to take a quick look at the data.
Now, as it says here, if you want to analyze data that needs more complex expressions, you use Athena. Now you can see when you're gonna work with Athena, that'll take you to a different console, so I've actually opened and pre set-up a couple things here. So inside of Athena what you have is, you have a ANSI SQL queryability, so serverless query, on top of, information that's stored in files in S3, but in order to be able to do this kind of, and this as simple as possible query here, select star, you need to first define a tabular structure on top of the source files.
And you can see here I have this in the database. The database really is a little bit of a misnomer, all this is, is metadata. The files are actually stored in S3. You can see that I can create something manually here, so I clicked Manual, then I can create a database, and then add a table to the database, and so on and so forth. Or I could use another service called AWS Glue, and the reason that you would use that, rather than manually creating a table structure here, is because Glue would save the metadata of the tabular structure for use by services other than Athena.
Which services? Namely Redshift Spectrum. So now you start to see this idea of, you have the data lake, so the files stored in S3, and then you have the various query services. So you have S3 Select, you have Athena, and then you have Redshift Spectrum. So the whole thing together, is a set of serverless SQL queries, that are applied to a data lake structure. So why is this so significant in terms of data scaling? Well, you completely eliminate any management of virtual machines and virtual server clusters, and you only pay for the query execution, so it's time sliced.
You can see for example inside here, I did a little select star, and you can see the runtime is this, and the data scanned is this, and this is how you pay. Now if I wanted to run another query, and if I just put in request time stamp, that would be less data, and run the query, I can format of course, then again I just get that result, and there is the amount of data queried. Now you can see there's a link here to the AWS Data Glue Catalog, and as I mentioned in the beginning of this movie, Glue is the really third part of this.
Not only does it have the metadata, but also has the ability to do serverless transforms. This was classically called extract, transform, and load, or extract, load, and transform. Depends on the order in your particular pipeline. So a couple different concepts here, so over in Glue, you can see that we have this idea of tables, which is very similar to what we have in Athena, it's just tabular definition, this is for those logs. And notice here we have connections, so if I make a new link here, you can see that if I add connections, and let's call this a demo, and I can connect to Redshift, RDS, or JDBC.
Really what I'm seeing with these new services, is Amazon is providing choice for customers, and also providing a pathway, to go from server-based scaled data management and query, to serverless. It's really quite innovative, and really quite interesting. So let me close this. You can see they also have crawlers and classifiers. So a crawler basically, as it says here, connects to a data store, progresses through a prioritized list of classifiers to determine the schema for your data, and then creates metadata tables in the data catalog.
You can see inside of here, here's the crawl, basically the idea here is that if the schema changes, the underlying schema of your CSV files, you can re-crawl and have a dynamic schema update. And this is really pretty exciting. I have quite a few customers for whom this is a, quite a compelling feature, because in the past, in a relational database, making a schema change was a non-trivial situation. Now that being said, this is really a paradigm shift, in terms of how you would manage this, and how you would then make the applications that connect to this information still function properly if there were schema changes.
So it's not magical, it's just that it's a new way of thinking about working with data that is not structured, but utilizing all of the knowledge that's out there, in the dev world, in terms of SQL queryability. The third part of this, is going to a transformation job in the Glue console. And this to me looks like very classic ETL. You have the picture over here, notice the Database Name, and the Database Table, is a CSV file. This is just one from S3 that I have, and then I applied a mapping inside of it, and then I output the new file here.
And inside of here I've chosen to use PI Spark. So, interesting how this new functionality allows you to do ETL serverlessly. So these services working together, S3 Select, Athena, Glue, and Redshift Spectrum, which works similarly to this, it's just once you set up a Redshift cluster, you can extend the Redshift queryability on to S3 source files, and again be charged by the query, rather than being charged by adding to the amount of data storage in your cluster. It's all very innovative, very impactful, in terms of data scaling, and provides a wide menu of choices when you're trying to figure out which Amazon services are gonna work best for your data scaling solution.
- Scalability in AWS application design
- Scaling serverless vs. server-based applications
- Scaling files
- Storage design approaches
- Design approaches to scaling data and data storage
- Scaling SQL queries
- Understanding Data Migration Service
- Scaling applications