Learn how to define and identify ETL (extract, transform, load) jobs and describe the architecture of an AWS Glue job.
- [Instructor] AWS Glue provides a similar service to Data Pipeline but with some key differences. First, it's a fully managed service. You don't provision any instances to run your tasks. Second, it's based on PySpark, the Python implementation of Apache Spark. You design your data flows in Glue by connecting sources to targets with transformations in between. The Glue wizard and GUI help you define these jobs which generate PySpark code. If you're familiar with Python and Apache Spark, you'll be right at home. If not, Glue can get you started by proposing designs for some simple ETL jobs.
Because Glue is fully serverless, although you pay for the resources consumed by your running jobs, you never have to create or manage any ctu instance. Another core feature of Glue is that it maintains a metadata repository of your various data schemas. This could be relational table schemas, the format of a delimited file, or more. Although it is sometimes confusing, Glue calls these metadata repositories databases. You can define a schema in one of two ways. First, you can manually enter it. This would involve you typing the name of each data column and then specifying its type and data width.
Alternatively, Glue can search your data sources and discover on its own what data schemas exist. To do this you must define what's called a crawler. Crawlers can read from S3, RDS, or JDBC data sources. They require a database login account. They can discover table schemas but they do not discover relationships. Finally they can be scheduled to update themselves over time. With crawlers keeping your metadata up to date, mapping source data to destinations becomes fairly straight forward. Keep in mind that existing jobs are not automatically aware when schemas change, and they still need to be refreshed.
Jobs can be triggered on a schedule, such as daily or monthly, on completion of another job, wherein we chain dependent jobs, or on demand. Finally, a few caveats of Glue. Unlike many popular ETL packages, it has no third-party connectors. You're not going to be connecting to sales force out of the box with Glue. As I mentioned earlier, when schemas change, you'll need to update the jobs that use them. Chaining dependent jobs is possible but job chaining is not easy to visualize once built. Finally the wizard and GUI are only suitable for very simple jobs.
After that, you're running Python. The display can be updated but it's not really a mechanism for writing jobs. Glue cannot really be called a no code solution. With that said, let's get into Glue and see what it can do.
Join AWS architect Brandon Rich and learn how to configure object storage solutions and lifecycle management in Simple Storage Service (S3), a web service offered by AWS, and migrate, back up, and replicate relational data in RDS. Find out how to leverage flexible network storage with Elastic File System (EFS), and use the new AWS Glue service to move and transform data. Plus, learn how Snowball can help you transfer truckloads of data in and out of the cloud.
- What is data management?
- AWS S3 basics
- S3 bucket creation
- S3 upload and logging
- S3 event notifications
- S3 data lifecycle configuration
- Working with Amazon Elastic Block Store volumes
- Creating and mounting an EFS
- Creating an AWS RDS instance
- RDS backup and recovery
- Moving data with AWS Database Migration Service
- Moving data with Data Pipeline and Glue