Learn how to differentiate Glacier storage from normal S3 storage, and S3 Glacier tier from standard Glacier storage.
- [Instructor] AWS Glacier is Amazon's long-term data archiving solution. It is designed to store vast quantities of data as reliably and cheaply as possible. As of this recording, AWS advertises rates of as little as four tenths of a cent per gigabyte, per month. There's literally no limit to the amount of data you can store, and a single archive file can be up to 40 terabytes in size. Let's get a few terms out of the way. In Glacier, you work with vaults. Vaults are roughly equivalent to a bucket in S3.
They're the highest level container into which you insert data, and to which you apply policies. The data files within a vault are referred to as archives, more or less equivalent to objects in an S3 bucket. Like S3, Glacier promises an average annual durability of 11 nines. When your data appears as saved in Glacier, it has already been replicated across multiple locations in Amazon's AWS data centers. In addition, data in Glacier is encrypted by default, and accessible only by the account owner.
Of course, there's a trade off for this low cost reliability. As the Glacier name implies, to retrieve data, you must effectively unfreeze it, and the process is not immediate. Think of Glacier like a physical library full of books. You interact with a librarian, who must locate the resource you need. The books, your data, are always there, but it takes some time for them to be made available to you. In fact, to extend the metaphor further, Glacier is like an archive of rare books. The librarian will retrieve what you want, at which point you have a limited amount of time with those books.
You're free to make copies while the books are in the reading room, and this is how Glacier works. You initiate a retrieval request on a certain archive. After a while, usually a few hours, the archive becomes available for download from EC2. After a set amount of time, that option goes away, and the archive remains in Glacier. Deleting an archive from your Glacier is an entirely distinct step. When you initiate a Glacier retrieval job, you have a few options. Standard retrieval is usually complete in about three to five hours. Bulk retrievals are the lowest cost option.
The time to return is five to 12 hours. Expedited retrievals can be available in one to five minutes, but you must pay extra. When the data is available, you can download it using AWS CLI, the Glacier rest API, or by writing code using the AWS SDK for any of the many languages supported. You can configure individual vaults to notify you via SNS whenever a retrieval job is complete. What this means of course, is that for any data you need available at a moment's notice, Glacier is not the ideal storage location.
On the other hand, if you have a subset of data that you must retain, but it's okay if it takes a few hours to access, Glacier can provide a fantastic archiving solution. A few things you should know about Glacier. First of all, Glacier has a minimum 90 day storage policy. You can delete archives earlier than that, but you'll be subject to a small fee. Next, when you're retrieving data via an EC2 instance, you may incur fees if you go across regions to get to your vault. Instances in the same region as the vault will not incur a fee. Finally, Glacier can be somewhat challenging to work with, since its web console capabilities are limited.
More on that in a moment. Like almost every other AWS service, Glacier vaults can have IAM policies attached. These policies set rules on who can access the vault and how. They're tied to IM users or other principles. Another type of policy in Glacier is the vault access policy. This type of policy applies broad rules to all users. For example, to prohibit any deletion. Vault access policies are attached directly to vaults, similar to S3 bucket policies. Glacier allows you to set one other kind of policy, a vault lock policy.
These are special policies that can be themselves locked to prohibit changes. Sometimes you have compliance rules that absolutely must be enforced. Vault locks make sure that you can do this in a compliant, auditable way. For instance, you can enforce write once read many policies that make sure your archives are never modified. Or you can enforce a policy that prohibits deletion before an archive has existed for a certain amount of time. Vault lock policies start in a pending state, so you can make sure you have them right before they go into effect.
After that, the vault lock policy cannot be updated, only deleted. It's a bit like locking your cars with the keys inside. You're deliberately setting rules that no one, not even you, should be able to break. So how do you work with Glacier? While it's trivial to set up a new vault in the AWS console, getting data in and out requires you to either work with the AWS command line interface, call the rest API, or use an AWS SDK such as the ones for Python and Ruby. If you're building a script or application work with Glacier, this is great.
There are also third -party applications that have Glacier functionality built in. But sometimes, you want things a bit easier. Fortunately there's another much easier way to work with Glacier, and that's using S3 and S3 lifecycle rules. In many cases you may not want to archive data directly to Glacier. Rather, you'll be starting with hot data, data that is frequently accessed. What you'd like to do, is migrate it over time as it becomes less frequently used, to cheaper long-term storage. S3 lifecycle rules allow exactly that, with the final cheapest option being the Glacier storage tier.
When S3 objects make their way to Glacier, these objects do not appear in the Glacier console. Rather, they continue to appear as S3 objects with a storage tier of Glacier. At that point they will not have a download button, but a begin retrieval button. In this way, S3 Glacier tier objects work just like normal Glacier archives, but with the added bonus of being managed by automated lifecycle rules. It must be noted that S3 Glacier tier is not exactly Glacier proper, and there are two drawbacks to getting data to Glacier this way.
First, you can't set vault lock policies. And second, there's not good way to get notifications when retrievals are complete. But aside from those minor issues, S3 lifecycle rules are a really convenient way to take advantage of the cheap reliable storage that Glacier provides. S3 lifecycle rules can migrate data to cheaper storage tiers in S3. They're based on object days in age. They can not be based on access frequency, which is unfortunate. And they will not apply to objects less than 128k in size.
Just a reminder about your options with S3. You can directly upload files to standard, infrequent access and reduced redundancy tiers, but you can not directly upload to Glacier. With lifecycle rules you can transition from standard to IA, or standard to Glacier. You can even transition from IA to Glacier, which makes sense when you consider the typical aging transition path we've discussed. However, that's it. You can't transition backward from Glacier to anything. You can't go back from IA to standard, and you can't transition anything to reduced redundancy.
That's strictly an upload time decision. Now that we have an idea of how S3 lifecycle rules work, let's take a look at setting some up.
Join AWS architect Brandon Rich and learn how to configure object storage solutions and lifecycle management in Simple Storage Service (S3), a web service offered by AWS, and migrate, back up, and replicate relational data in RDS. Find out how to leverage flexible network storage with Elastic File System (EFS), and use the new AWS Glue service to move and transform data. Plus, learn how Snowball can help you transfer truckloads of data in and out of the cloud.
- What is data management?
- AWS S3 basics
- S3 bucket creation
- S3 upload and logging
- S3 event notifications
- S3 data lifecycle configuration
- Working with Amazon Elastic Block Store volumes
- Creating and mounting an EFS
- Creating an AWS RDS instance
- RDS backup and recovery
- Moving data with AWS Database Migration Service
- Moving data with Data Pipeline and Glue