Learn about combining queries from multiple sources.
- [Instructor] As I mentioned before, one of the interesting things that Presto gives you is the ability to combine data sources inside of a single query. So, as you saw in our last example, we found the top 20 customers with an above average account balance, and this was all coming from mysql, and that was pretty straightforward. It was a pretty basic example of some analysis you may come across. Now, what if we wanted to add another table from hive that's related to our mysql data? Well, let's take a look.
First, we need to take a look at what's inside of hive, so, show schemas from hive, and you can see we have default and information schema, so, I assume we'll be in default, so I'll say show tables from hive.default. Okay, you can see we have foo, nation, and region. Foo, if you recall earlier, we created that one, now, I believe nation and region are both tied to this, so, let's just take a look at the my SQL table again. Select star from mysql.test.customer limit 10, okay, so we have customer key, name, address, ah, nationkey.
So, we have nationkey here, which looks like it'd be related to nation, and that's about it. It doesn't look like I have the region key, so, I'll take Q, now let's just go take a look at that hive table real quick. Select star from hive.default.nation limit 10, ah, you can see we have n underscore nationkey. Now, that looks like it should correlate to the nationkey column in my customer table in mysql. So, let's exit out of this, and let's try to join those two.
So, first I'm just going to do a very simple query here which would be select, count (1) from mysql.test.customer, then we'll say as c for customer, and then we'll join to hive.default.nation as n, so, we'll give it an alias as well, on c.nationkey = n.n_nationkey.
So this will just tell me if the tables join and there are results being returned, and you can see, we got a 100 results. If you recall, there are only 100 rows in that mysql table, so that means they all match, that means we have 100% matches from these two tables when we're trying to join them, which is great. So, let's now add that to our analytical query that we started with, and if you don't recall, just hit the up arrow a few times until we get there. This is the top 20 customers with an above average account balance sorted in descending order, so it has the market segment, their account balance, the average account balance for that market segment, and then the difference.
So, with this data, if I wanted to join it up to my hive table, I simply need to add that joint condition to my query. So, I'll paste it in here, some of those parts of that initial query that we had. We start with the with statements that gives us the average for the market segments, then we specify the columns we want, and there's a couple of new ones here, I'll paste these in. N_name is the name of the nation, and everything else there is the same, then we go from our table in mysql, and we give it an alias of cust, so we know what that is, we join to the bals table, which is the one we just created up above, then we need to join to the hive table.
So, this is where I'll type it out for you. So, we do join hive.default.nation, so we're now joining over to an entirely different data platform call it nat for short, and we're joining on cust.nationkey = nat.n_nationkey, then we say where the account balance minus the average balance is greater than zero, meaning it's an above average balance, then we order by the diff. Order by diff descending fashion, and we limit that to the top 20 rows.
So, the one thing you note about this that's really interesting is that although we're combining data from mysql and hive in a single query, there's no special syntax we have to use, no complex connection info, all that has already been predefined in Presto in the catalog. So, when we just say, hive.default.nation, it knows everything it needs to in order to run that query, get that data, and give us our answer here. So, when I run this, you can see I have the similar results as above, except now I've added the nation to my result set, and you can continue on, and you can use multiple withs statements to create multiple different aliases for result sets, and combine them in a much simpler way.
Data science expert Ben Sullins helps you get up to speed with Presto, and leverage it to accomplish a wide-range of data science and analytics tasks. He uses different interfaces with Presto—such as R and Tableau—and digs into the expressive SQL language that Presto offers for your analysis. At the end of this course, you'll know the key concepts of Presto and how to use them to take full advantage of your modern big data system.
- What does Presto do?
- Running Presto
- Connecting from Tableau and R
- Connecting to Hive, MySQL, and the local system
- Retrieving data
- Combining data sources
- Basic SQL functions
- Advanced SQL functions
- Migrating from Hive