Learn how to identify the use cases and architecture behind AWS Data Pipeline.
- [Instructor] At this point we've seen a number of ways to move data around within AWS. For instance, you have automatic replication tasks, configurable on services like S3 and RDS. And you have event based actions like S3 Events and Dynamo Streams. But data integration is more than just replication. Often you need to build that processes that extract data from one location, transform it and load it elsewhere. Extract, transform, load or ETL for short. AWS has two services for providing this capability, Data Pipeline and AWS Glue.
Let's start with Data Pipeline. Data Pipeline is a tool for building repeatable data flows using a graphical editor. Each flow or activity consists of multiple steps that perform actions you define. For instance, you might extract data from some source such as a CSV file, JSON object or database table. You then transform the data however you need using one of Pipeline's supported languages such as Pig or Hive. Finally you load the resulting data into a target. That target could be for instance, an S3 object, a database or a data warehouse.
In Pipeline you have DataNodes and Activities. Examples of DataNodes include S3 DataNode, and MySQL DataNode. DataNodes may be sources or syncs of data. Sync is another word for target. Activities include copy activity, SQL activity and shell command activity. There are also activities for common big data processing languages, like the aforementioned Hive and Pig, and AWS' own Elastic MapReduce. Each activity must be run on a resource.
Data Pipeline is a sort of semi-managed service. When you create a job you can tell Pipeline to provision EC2 instances for your activities to execute on. You can then run each activity on its own instance, or have them share one. Perhaps there is certain parts of the job that you want to run a larger instance size. Or perhaps you'd like to execute parts of the job on Premises. That's right, as long as you install Data Pipeline's Java based task runner, you can process activities with local resources. Jobs can be run on demand, or they can be scheduled, so they can process data on a regular basis.
For more, let's take a look at a simple Pipeline job generated by the DynamoDB service.
Join AWS architect Brandon Rich and learn how to configure object storage solutions and lifecycle management in Simple Storage Service (S3), a web service offered by AWS, and migrate, back up, and replicate relational data in RDS. Find out how to leverage flexible network storage with Elastic File System (EFS), and use the new AWS Glue service to move and transform data. Plus, learn how Snowball can help you transfer truckloads of data in and out of the cloud.
- What is data management?
- AWS S3 basics
- S3 bucket creation
- S3 upload and logging
- S3 event notifications
- S3 data lifecycle configuration
- Working with Amazon Elastic Block Store volumes
- Creating and mounting an EFS
- Creating an AWS RDS instance
- RDS backup and recovery
- Moving data with AWS Database Migration Service
- Moving data with Data Pipeline and Glue
Skill Level Intermediate
Amazon Web Services: Monitoring and Metricswith Sharif Nijim2h 4m Intermediate
Amazon Web Services: Data Serviceswith Lynn Langit4h 30m Intermediate
Amazon Web Services: High Availabilitywith Sharif Nijim2h 17m Intermediate
Amazon Web Services for Data Sciencewith Lynn Langit3h 56m Intermediate
2. Object Storage
3. File Systems
4. Database Services
5. Getting Data to AWS
6. Moving Data in AWS
- Mark as unwatched
- Mark all as unwatched
Are you sure you want to mark all the videos in this course as unwatched?
This will not affect your course history, your reports, or your certificates of completion for this course.Cancel
Take notes with your new membership!
Type in the entry box, then click Enter to save your note.
1:30Press on any video thumbnail to jump immediately to the timecode shown.
Notes are saved with you account but can also be exported as plain text, MS Word, PDF, Google Doc, or Evernote.