In this video, take query processing one step further and learn how expression, table, and row subqueries are processed. Learn about correlated and noncorrelated subqueries and how these can be used to perform more complex logic and as a replacement for joins.
- [Tutor] Subqueries specify a table expression, a row expression, or a scalar expression. They're all derived from a query expression. In simple terms, it's a query that returns either a table, or a row, or a single value. A subquery may contain any clause except for order by, which is only allowed when used in conjunction with a limit or an offset clause. We'll see an example. Subqueries return a set, and a set has no order. A table subquery may return any number of rows and columns, and I will use the term derived table to refer to table expression subqueries. A row expression subquery may return only one row with one or more column. And lastly, a scalar subquery may return only one row and only one column, and that's why we can treat it as if it was a scalar expression. Trivia fact, a scalar is an element of a field which is used to define a vector space. A quantity described by multiple scalars, such as having both direction and magnitude, is called a vector. Logical query processing works exactly the same way for all types of subqueries. The same as normal queries, with a few exceptions. Each of the three subquery types can be either correlated or non-correlated. Non-correlated subqueries are standalone queries. They are independent of the query that they are contained in, which is often called their parent or the outer query. Correlated subqueries do not stand alone. They include an expression that references a value from the outer or parent query. Example. This sub query has no dependency or correlation to the parent query from FooBar. It stands by itself and returns the same result for all rows of the outer query. Although average foot will be returned for each row of Foobar, most databases will execute it just once and use the result for all rows to improve performance. But once per row or once per statement, it doesn't really matter. You can think of it any way that makes you happy. This sub query is correlated. There where predicate references Foo, a value that does not exist in the scope of the subquery. The result of this subquery depends on the current outer query row value of Foo, so the database must execute it again for each and every row. Well, if databases were a bit smarter, there could have executed the sub query for each unique value of Foo, but things are not always as simple as they seem. We need to show all adoptions, with their fees, the maximum fee that was ever paid, and the discount of the current fee from the maximum in percent. Select max adoption fee from adoptions finds the maximum fee ever. Now we can take it, put it in parentheses, and use that as an expression subquery inside the outer query that selects all rows from adoptions. These are two independent queries. The outer query select all adoptions, and the sub query always returns the max fee. It also happens to be from the same adoptions table. The maximum fee of $100 is returned for all rows. The discount percent for X from Y is calculated as Y minus X multiplied by a hundred and divided by Y. Now, Y is the max, and it appears twice in the formula, so we must repeat the sub query twice more. Let's execute, and we can see that Patrick got a 42% discount. He paid only $58 instead of a hundred. Diane came in driving a Ferrari and got no discount. Now we need to calculate the maximum fee for each species instead of overall. We need to correlate each fee to their respective adoption rows. And as promised, here is a cool way to visualize how correlations work. For dogs, we need a where that filters only for dogs. For cats, the literal value needs to be changed to a cat, and the same for rabbits. Now, instead of using static literal filter values, we can use a reference from the outer query so that the value of the filter will dynamically change based on the species for the current row from the outer query. Since both the query and the subquery select from the same table, we must provide unique aliases to distinguish between the species of the outer query, A1 species, and that of the subquery, A2 species. At runtime, A1 species will be replaced with the species from the outer row, and the max from A2 will be picked among the rows only for that same species in the second instance of the table A1. Next we need to show all attributes for people who adopted at least one animal. Persons are identified by their email, and their attributes are stored in the person's table. Select star from persons reveals we have 120 persons total. Select star from adoptions reveals we have 70 adoptions. You already know how to use an inner join between persons and adoptions to do that filter. Select from persons inner join adoptions on adopter email equals email will give us the correct result. Just don't forget the distinct set quantifier, or you will get the same person multiple times in case they adopted more than one animal. Great, we have 49 adopters in total. Another technique to do that would be to add a filter. Select from persons where email in, select email from adoptions, and here is a triple bonus points mini challenge. Let's execute this query and oops, something is wrong. We get all 120 persons back. How is that possible? This is a sneaky, hard to find, bug. Take more than one minute for this one. Pause the video for a few minutes and see if you can find the bug. If you found it, pat yourself twice on the back, because few people do. And the easiest way to show it is to execute the subquery by itself. And we get an error that email is an invalid column. There is no email column in adoptions. The email in adoptions is called adopter email. So how come this query ran successfully? When evaluating column references of a sub query, the database first tries to match aliases in the scope of the sub query. If it can't find a match, it will assume that this is a correlation to a column from the outer query, and will try to match it there. What happened here is that when email was not found in adoptions, the email from persons was used. Since the end predicate compares the same email to itself, it will always be true, and all persons are returned. Be very careful with aliases in subqueries. This bug has bitten many smart and experienced developers. The correct subquery is select adopter email from adoptions, and with this one, we now get the correct result of 49 adopters. Now, for another cool predicate called exists. Exists is typically used in the where clause, and is followed by a correlated subquery. It is evaluated for each row and the rows for which the sub query we turns at least one row will evaluate to true. Nothing prevents us from using a non-correlated subquery in exists, but it doesn't make much sense, since, if we do that, all rows will either evaluate to true or false. For example, where exists select from adoptions where species equal dog will return all 120 persons. The subquery always returns at least one row, since we do have dogs that were adopted, and the predicate will evaluate to true for all rows from the outer query. But if we change dog to elephant, all rows will evaluate to false. Exists only makes sense when it's evaluated per row. So that's why we need the correlation. To find people who adopted animals using exists, we will correlate their emails. That's alias persons as P, adoptions as A, and correlate where A adopter email equals P email. Let's execute, and voila, our 49 adopters show up. One more thing. Some find it confusing that I write a select null inside the exist sub query. The truth is that it doesn't matter. You can write null, star, or your full name. The select clause doesn't return anything when it's used with exists, it's there just to make the syntax valid. The reason I use nulls is mostly historical. Back in the days when SQL server was still Sybase, this actually had a performance impact, as the expressions were evaluated by the database, even though they weren't going anywhere. It also serves as a visual indicator that the subquery is part of an exists. Old habits die hard.
This course was created by Ami Levin. We are pleased to offer this training in our library.