Learn how to define and run an AWS Glue job to transfer relational data from RDS MySQL to S3.
- [Instructor] Now that Glue knows about our S3 metadata for the states.csv file, and it has a connection to MySQL, it's time to create a job. Click Jobs under ETL on the left and choose Add Job. The name for this job will be StatestoMySQL. As usual, we choose the GlueServiceRole that we created earlier. In this job, we're going to go with a proposed script generated by AWS Glue. Glue is going to create an S3 bucket to store data about this job.
It prefills that, but it does not prefill in the temporary directory below. All we need to do is copy the path, backspace out our root, and give it a name like tmp. Keep everything else at default and choose Next. Here we're choosing data sources. We have the states metadata CSV, so we'll choose that and click Next. For the target, we don't have any metadata describing a place that we want to put data into. Instead, we want to choose create tables in your data target. Check that radio button and we're given different options.
The data store is JDBC. The connection to the database. The connection is the one that we've already built, the MySqlDemoConnection. The database name will be MySQL RDS demo, the name of the database that we created earlier in the series. Click Next. This screen shows a mapping UI where we can tell Glue how to map the sources on the left hand side to the columns on the right. This is not a particularly sophisticated screen. We can change things around, but we can't map a column to more than one place.
So leave everything as-is because it's matched based on the column names. If the target side were an existing schema, we might want to move things around and map things in a different way, but for now, it's going to create these columns new. Click Next then click Finish. Now we're at the job view of AWS Glue. You can see a visual representation on the side of what's going to happen here. The database source, the transform, which is the mapping, and then down at the bottom, our target, MySqlDemoConnection.
In the code window, this is the Pi Spark code that was generated based on our choices. You see some includes here at the top, and then Python code that uses Apache Spark. Browse through it and you can get an idea of what's happening here. Our datasource0 is the S3 metadata. As you can see, states metadata is the database. States CSV is the table name. Glue is taking advantage of the metadata that it has already crawled. It's going to apply our mapping so you see column to column in this line, DropNullFields, and finally, write the data frame to our MySQL Connection.
To run this job, click Run Job. Then again, Run Job. This may take a little while to complete. In the logs section down here, we'll see quite a bit of output. Again, Glue is an early service and one of its downfalls is that the logging is incredibly verbose. We'll see both output from the job and any errors down here intermingled along with some very, very extensive logging that can be hard to parse through. When it's done, the Run Job button will stop spinning and we'll know that we can go look at our MySQL database.
So you can see now that the job has completed because the spinner has gone away from Run Job and we've not been shown any errors. However, we do have a lot of log data as you can see by scrolling through. Let's see what happened in the MySQL Target. I'll switch over to MySQL workbench. Here I'm already logged in to the My SQL database. If I click this icon to look at the database details, I can choose tables, expand the first row and see the states CSV table has been created. If I right click, say Send to SQL Editor as a Select All Statement, we can query it and see what's in there.
In there, all of our data from the states.csv text file that was in S3 has now been transformed into relational data in the MySQL database.
Join AWS architect Brandon Rich and learn how to configure object storage solutions and lifecycle management in Simple Storage Service (S3), a web service offered by AWS, and migrate, back up, and replicate relational data in RDS. Find out how to leverage flexible network storage with Elastic File System (EFS), and use the new AWS Glue service to move and transform data. Plus, learn how Snowball can help you transfer truckloads of data in and out of the cloud.
- What is data management?
- AWS S3 basics
- S3 bucket creation
- S3 upload and logging
- S3 event notifications
- S3 data lifecycle configuration
- Working with Amazon Elastic Block Store volumes
- Creating and mounting an EFS
- Creating an AWS RDS instance
- RDS backup and recovery
- Moving data with AWS Database Migration Service
- Moving data with Data Pipeline and Glue