When working with redshift you need to understand how the data should be loaded. In this video you will learn distribution style - how the data will be copied or split across partitions of the cluster, production workloads makes complete copies of the table across all the partitions depending on the size of the tables and how they're joined based upon query patterns.
- View Offline
- [Voiceover] So, as you've seen, the mechanics of setting up a cluster, creating a database, and loading it with CSV data is pretty simple. There's a little bit more to working with Redshift and let me just highlight some of those points here. In order for Redshift to work properly, you need to understand how the data should be best loaded. There's a couple of considerations, one's called "Distribution style" and what that means is how the data will be copied or split across partitions. Of course, we set up a single cluster, so this is a frequent consideration in production workloads.
The default is "All", which makes complete copies of the tables across all the partitions or splits. In some cases, depending on the sizes of the tables and how they're joined in your query patterns, it would be more effective to do an even or distribute split and also to use a sort key so that the data is pre-sorted. Amazon has some great documentation and attest about this and I'll put the link to that on the side so that you can try that out if you're gonna move to production. In addition, if you're doing ongoing loading to non-empty tables, you may want to use a Redshift feature called vacuuming to remove empty space and look at the workload management tuning parameters.
In general, you want to understand how to properly load based on distribution styles, sort keys, and your query patterns. Just to give you a little insight into that, let's run a query on Redshift. To do that, from our client, I'm gonna right click on Queries and say "New Query" and then I'm gonna take this baseline query which retrieves some information, doesn't aggregate or count, and it has some "where" conditions, has an "and" clause, a "group by" and an "order by", so it's relatively computationally intensive.
And I'll click Run to run the query and you'll see in seven seconds we got our result. Now that might be appropriate for your business case and it may not. In order for you to work with the underlying configuration settings, you can look back at the cluster and let me show you how that works. I'm gonna refresh the cluster and I'm gonna click on Queries and you can see, here's our most recent query, so if we click on the query execution number, as we did with the load job, we can see the overhead on our particular implementation of our Redshift cluster and make appropriate adjustments either to the size of the cluster, the sequel statement, or the load parameters, or a combination of all three, so I'm gonna scroll down and you can see the query execution plan, obviously you'd probably want to look at Amazon's documentation to see if this is optimal.
Scrolling down further, I can look at the physical resource usage for this particular query and I can use this information to make decisions about adjustments so I can get the best performance. This is a tip from the real world. Although Redshift is very relational-like, remember it's a columnstor under the hood. If you're gonna move to production based on what you've seen and learned, I recommend that you take a couple of hours and you go through Amazon's tutorial over here and what they do is they take you through this and they have you do a before and after benchmark, having you work with some of those parameters that I discussed in the beginning of this movie, the distribution and sort keys and the query processes so you can see the changes in time based on this common pattern.
I've taken several real world customers through this and they've been able to apply this directly to their workloads and get better results.
Starting with top-level categories of storage, data, computer, and services, Lynn guides you through planning your ideal AWS architecture, providing service demos using the AWS Console, command-line interface, and other tools. Learn when to use which service for which business case, such as Docker or Lambda or DynamoDB or Aurora? She shows how to script creation of services such as S3 buckets and EC2 instances, create and populate a managed data warehouse, and develop a data processing pipeline that works for you. Chapter 6 covers the AWS Internet of Things (IoT) services.
These exercises can help you build proof-of-concepts, minimum viable products, and deployable solutions to scale and support big data initiatives at your company.
- Setting up your AWS account
- Using AWS tools
- Defining your minimum viable products
- Choosing computer, storage, and data services
- Using S3, EC2, or Docker for website hosting
- Developing an AWS website
- Using a data warehouse
- Developing a data processing pipeline
- Developing an Internet of Things project with AWS