Lean how to use AWS Glue to create a user-defined job that uses custom PySpark Apache Spark code to perform a simple join of data between a relational table in MySQL RDS and a CSV file in S3.
- [Narrator] For this demo we're going to be joining two heterogeneous sources. First, the employees table and then my SQL database. You see that I have it selected here, and you can see the employees that exist there. Each one has an ID and a department ID. To this we are going to join another file. You will find it in the chapter six folder of the exercise files. It's called employeesatisfaction.csv. Let's take a look. This is an example of what might be a survey given to employees. Overall satisfaction with their job and different categories.
And there are a few comments. It also has employee ID, this will be the field we use to join the two data sources using AWS Glue. Now let's head to the AWS main console to start the job. We'll go directly to Glue this time. The first thing that we need to do is make Glue aware of both sides of this join. So we need to create a new crawler. Click Add Crawler under the crawlers menu. This one is going to be MySqlDemoCrawler.
The datastore for this one is not an S3, but JDBC, it's a database connection. We'll use the existing MySqlDemoConnection. The path should be the database that we're going to use. You can see this suggestion in the help text below. Mysql_rds_demo was the name of our database. Click next. We do not want to add another datastore, so choose no and click next. Once again we'll choose our AWSGlueServiceRole. Run on demand is fine, we'll start it in the next step.
Now we choose our database. We don't want to use the existing states_metadata. Remember, this is not the database that we're trying to connect to, but rather the metadata repository to which the table schemas that the crawler crawls will be added. Click Add Database. We'll name this mysql_metadata. Then click next. We'll review our choices. Choose finish, and on the next screen, click the checkbox next to MySqlDemoCrawler and choose Run Crawler.
While we are waiting for it to complete we can set up our data for the satisfaction survey. Let's leave AWS Glue via the main menu console. We'll go into S3. I've created a new bucket called employee-surveys, dash, my initials. and placed the employee-satisfaction.cvs inside. Now that that's there I'll head back to AWS Glue and set up a crawler for it. Under crawlers I'll choose Add Crawler and call it SurveyCrawler.
The datastore this time is S3, the path is s3://employee-surveys.bms/employee_satisfaction.csv, we'll choose next and continue. Again, we do not want to add another datastore so we'll choose Next. Go with the role that we've been using and click Next. Run on demand, click Next. And now we'll add another metadata database. We'll call this one survey_metadata.
No prefix to add, so choose Next. We'll review, finish, and run the crawler. It looks like the MySqlDemoCrawler is just about done. Let's hit the refresh button. Now we see the success message for the MySqlDemoCrawler and it notes that it has created six tables. Let's go to tables to check it out. We can see under the database mysql_metadata. There are now a number of tables, they correspond to those that are in the mysql database. Departments, employees, the new table that we created from the replication, project team members, even states.csv which we did in the last Glue demo.
Let's look at employees. This is the table that we're ultimately going to want to join to the data that's in s3. And you can see the schema there. In a moment we're going to need the database name and the name that has been assigned to this table. Take careful note that as far as Glue is concerned, the name of this table is not just employees, but it's mysql_rds_demo_employees. And that will be important. Let's head back to crawlers and see how the s3 crawler is doing. Looks like SurveyCrawler is almost done.
We'll quick refresh and wait for it to be complete. Now that it's ready we can click SurveyCrawler, go over to tables and see what it found. Here's survey_metadata and the employee_satisfaction.csv with all the columns that we expect. Now let's create the job that would join these two. On the left hand side go to ETL and Jobs. And we'll add a job. We'll call this JoinDemo. The IAM role of course is AWSGlueServiceRole.
This time instead of choosing a proposed script generated by AWS Glue, we'll choose the third option, a new script to be authored by you. The script that we will want to use is contained in the exercise files under chapter six and will fill in a few values. But before I get to that I would like to point out AWS's Github repo that they maintain with Glue examples. The url is github.com/awslabs/aws-glue-samples. Inside here you'll find a few examples of how to do things with the PYSpark code that is supported by AWS Glue.
Under examples you'll see join_and_relationalize. There's also a mark down that goes along with this. That mark down content is also in AWS's documentation. If you look at this code you'll see some of the code that is the basis for the demo I'm going to show you today. Click back to Glue console. Again we can copy the s3 path that's suggested for us into the temp folder, take away root and type tmp. And that can be anything we want it to be, but tmp makes the most sense for a temporary directory.
Click next. For connections, we'll want to use the MySql connection. So click Add, and then Next. At this point there's nothing else to do. Click Finish and we'll see a blank, blue screen. From here we want to go copy the exercise file from chapter six. Let's go take a look at it. At this point we are delving a little bit into PySpark code. To learn more, search the video library for Apache Spark. Let's take a look at what's going on here. At the top we have boilerplate includes that make sure that we have the Glue context and other important libraries for this job.
We create a Glue context, and then we set the employees table. We do something called create_dynamic-frame. The frame is a little bit like a table. And in fact you can see from the parameters that we are going to tell it both the metadata database name and the metadata table name for the crawled employees table. You see where I've said EMPLOYEE_METADATA_DATABASE_NAME_HERE, that's where we'll paste the value. And EMPLOYEE_TABLE_TABLE_NAME_HERE. From there we print the count from this table and the entire schema.
Similarly we create a dynamic frame this time using the metadata for this survey.csv. We then call a function that limits the fields that come back from the satisfaction.csv. We'll only take employee_id, overall_satisfaction and comments. Then we print the count from that dataframe and print the entire schema. Here we do the join. Join.apply, one dataframe to the next on the conditions employee.id and employee_id from the csv.
And that performs the join as an inner join. Then we count the join data and print the entire schema. Finally, we call some functions to repartition the data and write it out as a csv to our primary demo bucket which we will also replace. Let's copy the content of this file and head back to AWS Glue. I'll paste it in the code window and give myself a little more space. Before we do anything we're going to want to replace all of these instances where we referred to the metadata.
Let's do this. Open the AWS main console in a new tab. On that second tab we'll go to AWS Glue, into the tables. The first table we need is the employee table under the mysql database. Click into it and first grab the table name here at the top. Paste that for table name, then head back to the other tab, grab the database name, which is mysql_metadata, and replace this first value, EMPLOYEE_METADATA_DATABASE_NAME_HERE.
We'll do the same for the satisfaction data by going back to tables, clicking employee_satisfaction, grabbing the table name and pasting. Back on the second tab, we'll grab the name survey_metadata and paste that in. Finally, we'll scroll to the bottom and replace our demo bucket name. Mine is called demo-primary, and then my initials. From here I'll click save and run the job. Click Run Again to kick things off.
Now that the job is finished we'll exit via the main console and go to s3 to see our output. Under the demo primary bucket you can see employee_satisfaction_single_file.csv, but note that this is actually not the file but a folder, and within is a longer named file that represents out output. Click Download As, then right click on the link to say Save As. We'll name this file join_result.csv.
Opening the file you can see that we have the names that exist only in the mysql table, the comments that exist in the csv file, and the ratings. We also have employee IDs from both tables. To clean this up more would be a further dive into PySpark, but that's for another video series. We've now seen how you can use AWS Glue and PySpark code to join two heterogeneous data sources.
Join AWS architect Brandon Rich and learn how to configure object storage solutions and lifecycle management in Simple Storage Service (S3), a web service offered by AWS, and migrate, back up, and replicate relational data in RDS. Find out how to leverage flexible network storage with Elastic File System (EFS), and use the new AWS Glue service to move and transform data. Plus, learn how Snowball can help you transfer truckloads of data in and out of the cloud.
- What is data management?
- AWS S3 basics
- S3 bucket creation
- S3 upload and logging
- S3 event notifications
- S3 data lifecycle configuration
- Working with Amazon Elastic Block Store volumes
- Creating and mounting an EFS
- Creating an AWS RDS instance
- RDS backup and recovery
- Moving data with AWS Database Migration Service
- Moving data with Data Pipeline and Glue