Manage data with Transact-SQL

  • 7/19/2017

In Skill 1.2 from Exam Ref 70-761 Querying Data with Transact-SQL, learn how to query multiple tables by using join statements based on provided tables, data, and requirements.

Skill 1.2: Query multiple tables by using joins

Often, data that you need to query is spread across multiple tables. The tables are usually related through keys, such as a foreign key in one side and a primary key in the other. Then you can use joins to query the data from the different tables and match the rows that need to be related. This section covers the different types of joins that T-SQL supports: cross, inner, and outer.

Before running the code samples in this skill, add a row to the Suppliers table by running the following code:

USE TSQLV4;

INSERT INTO Production.Suppliers
  (companyname, contactname, contacttitle, address, city, postalcode, country, phone)
  VALUES(N'Supplier XYZ', N'Jiru', N'Head of Security', N'42 Sekimai Musashino-shi',
         N'Tokyo', N'01759', N'Japan', N'(02) 4311-2609');

Cross joins

A cross join is the simplest type of join, though not the most commonly used one. This join performs what’s known as a Cartesian product of the two input tables. In other words, it performs a multiplication between the tables, yielding a row for each combination of rows from both sides. If you have m rows in table T1 and n rows in table T2, the result of a cross join between T1 and T2 is a virtual table with m × n rows. Figure 1-6 provides an illustration of a cross join.

Figure 1-6

Figure 1-6 Cross join

The left table has three rows with the key values A, B, and C. The right table has four rows with the key values B1, C1, C2, and D1. The result is a table with 12 rows containing all possible combinations of rows from the two input tables.

Consider an example from the TSQLV4 sample database. This database contains a table called dbo.Nums that has a column called n with a sequence of integers from 1 and on. Your task is to use the Nums table to generate a result with a row for each weekday (1 through 7) and shift number (1 through 3), assuming there are three shifts a day. The result can later be used as the basis for building information about activities in the different shifts in the different days. With seven days in the week, and three shifts every day, the result should have 21 rows.

Here’s a query that achieves the task by performing a cross join between two instances of the Nums table—one representing the days (aliased as D), and the other representing the shifts (aliased as S):

USE TSQLV4;

SELECT D.n AS theday, S.n AS shiftno
FROM dbo.Nums AS D
  CROSS JOIN dbo.Nums AS S
WHERE D.n <= 7
  AND S.N <= 3
ORDER BY theday, shiftno;

Here’s the output of this query:

theday      shiftno
----------- -----------
1           1
1           2
1           3
2           1
2           2
2           3
3           1
3           2
3           3
4           1
4           2
4           3
5           1
5           2
5           3
6           1
6           2
6           3
7           1
7           2
7           3

The Nums table has 100,000 rows. According to logical query processing, the first step in the processing of the query is evaluating the FROM clause. The cross join operates in the FROM clause, performing a Cartesian product between the two instances of Nums, yielding a table with 10,000,000,000 rows (not to worry, that’s only conceptually). Then the WHERE clause filters only the rows where the column D.n is less than or equal to 7, and the column S.n is less than or equal to 3. After applying the filter, the result has 21 qualifying rows. The SELECT clause then returns D.n aliasing it theday, and S.n aliasing it shiftno.

Fortunately, SQL Server doesn’t have to follow logical query processing literally as long as it can return the correct result. That’s what optimization is all about—returning the result as fast as possible. SQL Server knows that with a cross join followed by a filter it can evaluate the filters first (which is especially efficient when there are indexes to support the filters), and then match the remaining rows. This optimization technique is called predicate pushdown.

Note the importance of aliasing the tables in the join. For one, it’s convenient to refer to a table by using a shorter name. But in a self-join like ours, table aliasing is mandatory. If you don’t assign different aliases to the two instances of the table, you end up with an invalid result because there are duplicate column names even when including the table name as a prefix. By aliasing the tables differently, you can refer to columns in an unambiguous way using the form table_alias.column_name, as in D.n versus S.n.

Also, note that in addition to supporting the syntax for cross joins with the CROSS JOIN keywords, both standard SQL and T-SQL support an older syntax where you specify a comma between the table names, as in FROM T1, T2. However, for a number of reasons, it is recommended to stick to the newer syntax; it is less prone to errors and allows for more consistent code.

Inner joins

With an inner join, you can match rows from two tables based on a predicate—usually one that compares a primary key value in one side to a foreign key value in another side. Figure 1-7 illustrates an inner join.

Figure 1-7

Figure 1-7 Inner join

The letters represent primary key values in the left table and foreign key values in the right table. Assuming the join is an equijoin (using a predicate with an equality operator such as lefttable.keycol = righttable.keycol), the inner join returns only matching rows for which the predicate evaluates to true. Rows for which the predicate evaluates to false or unknown are discarded.

As an example, the following query returns suppliers from Japan and the products they supply:

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice
FROM Production.Suppliers AS S
  INNER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
WHERE S.country = N'Japan';

Here’s the output of this query:

supplier        country  productid   productname    unitprice
--------------- -------- ----------- -------------- ----------
Supplier QOVFD  Japan    9           Product AOZBW  97.00
Supplier QOVFD  Japan    10          Product YHXGE  31.00
Supplier QWUSF  Japan    13          Product POXFU  6.00
Supplier QWUSF  Japan    14          Product PWCJB  23.25
Supplier QWUSF  Japan    15          Product KSZOI  15.50
Supplier QOVFD  Japan    74          Product BKAZJ  10.00

Observe that the join’s matching predicate is specified in the ON clause. It matches suppliers and products that share the same supplier ID. Rows from either side that don’t find a match in the other are discarded. For example, suppliers from Japan with no related products aren’t returned.

Regarding the last query, again, notice the convenience of using short table aliases when needing to refer to ambiguous column names like supplierid. Observe that the query uses table aliases to prefix even nonambiguous column names such as s.country. This practice isn’t mandatory as long as the column name is not ambiguous, but it is still considered a best practice for clarity.

A very common question is, “What’s the difference between the ON and the WHERE clauses, and does it matter if you specify your predicate in one or the other?” The answer is that for inner joins it doesn’t matter. Both clauses serve the same filtering purpose. Both filter only rows for which the predicate evaluates to true and discard rows for which the predicate evaluates to false or unknown. In terms of logical query processing, the WHERE is evaluated right after the FROM, so conceptually it is equivalent to concatenating the predicates with an AND operator, forming a conjunction of predicates. SQL Server knows this, and therefore can internally rearrange the order in which it evaluates the predicates in practice, and it does so based on cost estimates.

For these reasons, if you wanted, you could rearrange the placement of the predicates from the previous query, specifying both in the ON clause, and still retain the original meaning, as follows:

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice
FROM Production.Suppliers AS S
  INNER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
    AND S.country = N'Japan';

For many people, though, it’s intuitive to specify the predicate that matches columns from both sides in the ON clause, and predicates that filter columns from only one side in the WHERE clause. Also think about the potential that in the future you will need to change an inner join to an outer join, where there’s a difference between the roles that the ON and WHERE clauses play, as I describe in the next section. But again, with inner joins it doesn’t matter. In the discussion of outer joins in the next section, you will see that, with those, ON and WHERE play different roles; you need to figure out, according to your needs, which is the appropriate clause for each of your predicates.

As another example for an inner join, the following query joins two instances of the HR.Employees table to match employees with their managers (a manager is also an employee, hence the self-join):

SELECT E.empid,
  E.firstname + N' ' + E.lastname AS emp,
  M.firstname + N' ' + M.lastname AS mgr
FROM HR.Employees AS E
  INNER JOIN HR.Employees AS M
    ON E.mgrid = M.empid;

This query generates the following output:

empid       emp                             mgr
----------- ------------------------------- -------------------------------
2           Don Funk                        Sara Davis
3           Judy Lew                        Don Funk
4           Yael Peled                      Judy Lew
5           Sven Mortensen                  Don Funk
6           Paul Suurs                      Sven Mortensen
7           Russell King                    Sven Mortensen
8           Maria Cameron                   Judy Lew
9           Patricia Doyle                  Sven Mortensen

Observe the join predicate: ON E.mgrid = M.empid. The employee instance is aliased as E and the manager instance as M. To find the right matches, the employee’s manager ID needs to be equal to the manager’s employee ID.

Note that only eight rows were returned even though there are nine rows in the table. The reason is that the CEO (Sara Davis, employee ID 1) has no manager, and therefore, her mgrid column is NULL. Remember that an inner join does not return rows that don’t find matches.

As with cross joins, both standard SQL and T-SQL support an older syntax for inner joins where you specify a comma between the table names, and then all predicates in the WHERE clause. But as mentioned, it is considered best practice to stick to the newer syntax with the JOIN keyword. When using the older syntax, if you forget to indicate the join predicate, you end up with an unintentional cross join. When using the newer syntax, an inner join isn’t valid syntactically without an ON clause, so if you forget to indicate the join predicate, the parser will generate an error.

Because an inner join is the most commonly used type of join, the standard decided to make it the default in case you specify just the JOIN keyword. So T1 JOIN T2 is equivalent to T1 INNER JOIN T2.

Outer joins

With outer joins, you can request to preserve all rows from one or both sides of the join, never mind if there are matching rows in the other side based on the ON predicate.

By using the keywords LEFT OUTER JOIN (or LEFT JOIN for short), you ask to preserve the left table. The join returns what an inner join normally would—that is, matches (call those inner rows). In addition, the join also returns rows from the left table that have no matches in the right table (call those outer rows), with NULLs used as placeholders in the right side. Figure 1-8 shows an illustration of a left outer join.

Figure 1-8

Figure 1-8 Left outer join

Unlike in the inner join, the left row with the key A is returned even though it has no match in the right side. It’s returned with NULLs as placeholders in the right side.

As an example, the following query returns suppliers from Japan and the products they supply, including suppliers from Japan that don’t have related products.

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice
FROM Production.Suppliers AS S
  LEFT OUTER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
WHERE S.country = N'Japan';

This query generates the following output:

supplier        country  productid   productname    unitprice
--------------- -------- ----------- -------------- ----------
Supplier QOVFD  Japan    9           Product AOZBW  97.00
Supplier QOVFD  Japan    10          Product YHXGE  31.00
Supplier QOVFD  Japan    74          Product BKAZJ  10.00
Supplier QWUSF  Japan    13          Product POXFU  6.00
Supplier QWUSF  Japan    14          Product PWCJB  23.25
Supplier QWUSF  Japan    15          Product KSZOI  15.50
Supplier XYZ    Japan    NULL        NULL           NULL

Because the Production.Suppliers table is the preserved side of the join, Supplier XYZ is returned even though it has no matching products. As you recall, an inner join did not return this supplier.

It is very important to understand that, with outer joins, the ON and WHERE clauses play very different roles, and therefore, they aren’t interchangeable. The WHERE clause still plays a simple filtering role—namely, it keeps true cases and discards false and unknown cases. In our query, the WHERE clause filters only suppliers from Japan, so suppliers that aren’t from Japan simply don’t show up in the output.

However, the ON clause doesn’t play a simple filtering role; rather, it’s a more sophisticated matching role. In other words, a row in the preserved side will be returned whether the ON predicate finds a match for it or not. So the ON predicate only determines which rows from the nonpreserved side get matched to rows from the preserved side—not whether to return the rows from the preserved side. In our query, the ON clause matches rows from both sides by comparing their supplier ID values. Because it’s a matching predicate (as opposed to a filter), the join won’t discard suppliers; instead, it only determines which products get matched to each supplier. But even if a supplier has no matches based on the ON predicate, the supplier is still returned. In other words, ON is not final with respect to the preserved side of the join. WHERE is final. So when in doubt, whether to specify the predicate in the ON or WHERE clauses, ask yourself: Is the predicate used to filter or match? Is it supposed to be final or nonfinal?

With this in mind, guess what happens if you specify both the predicate that compares the supplier IDs from both sides, and the one comparing the supplier country to Japan in the ON clause? Try it.

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice
FROM Production.Suppliers AS S
  LEFT OUTER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
   AND S.country = N'Japan';

Observe what’s different in the result (shown here in abbreviated form) and see if you can explain in your own words what the query returns now:

supplier        country  productid  productname    unitprice
--------------- -------- ---------- -------------- ----------
Supplier SWRXU  UK       NULL       NULL           NULL
Supplier VHQZD  USA      NULL       NULL           NULL
Supplier STUAZ  USA      NULL       NULL           NULL
Supplier QOVFD  Japan    9          Product AOZBW  97.00
Supplier QOVFD  Japan    10         Product YHXGE  31.00
Supplier QOVFD  Japan    74         Product BKAZJ  10.00
Supplier EQPNC  Spain    NULL       NULL           NULL
Supplier QWUSF  Japan    13         Product POXFU  6.00
Supplier QWUSF  Japan    14         Product PWCJB  23.25
Supplier QWUSF  Japan    15         Product KSZOI  15.50
...

(34 row(s) affected)

Now that both predicates appear in the ON clause, both serve a matching purpose. What this means is that all suppliers are returned—even those that aren’t from Japan. But in order to match a product to a supplier, the supplier IDs in both sides need to match, and the supplier country needs to be Japan.

Just like you can use a left outer join to preserve the left side, you can use a right outer join to preserve the right side. Use the keywords RIGHT OUTER JOIN (or RIGHT JOIN in short). Figure 1-9 shows an illustration of a right outer join.

Figure 1-9

Figure 1-9 Right outer join

T-SQL also supports a full outer join (FULL OUTER JOIN, or FULL JOIN in short) that preserves both sides. Figure 1-10 shows an illustration of this type of join.

Figure 1-10

Figure 1-10 Full outer join

A full outer join returns the matched rows, which are normally returned from an inner join; plus rows from the left that don’t have matches in the right, with NULLs used as placeholders in the right side; plus rows from the right that don’t have matches in the left, with NULLs used as placeholders in the left side. It’s not common to need a full outer join because most relationships between tables allow only one of the sides to have rows that don’t have matches in the other, in which case, a one-sided outer join is needed.

Queries with composite joins and NULLs in join columns

Some joins can be a bit tricky to handle, for instance when the join columns can have NULLs, or when you have multiple join columns—what’s known as a composite join. This section focuses on such cases.

Earlier in the inner joins section is a query that matched employees and their managers. Remember that the inner join eliminated the CEO’s row because the mgrid is NULL in that row, and therefore the join found no matching manager. If you want to include the CEO’s row, you need to use an outer join to preserve the side representing the employees (E) as follows:

SELECT E.empid,
  E.firstname + N' ' + E.lastname AS emp,
  M.firstname + N' ' + M.lastname AS mgr
FROM HR.Employees AS E
  LEFT OUTER JOIN HR.Employees AS M
    ON E.mgrid = M.empid;

Here’s the output of this query, this time including the CEO’s row:

empid       emp                             mgr
----------- ------------------------------- -------------------------------
1           Sara Davis                      NULL
2           Don Funk                        Sara Davis
3           Judy Lew                        Don Funk
4           Yael Peled                      Judy Lew
5           Sven Mortensen                  Don Funk
6           Paul Suurs                      Sven Mortensen
7           Russell King                    Sven Mortensen
8           Maria Cameron                   Judy Lew
9           Patricia Doyle                  Sven Mortensen

As a reminder, the order of the output is not guaranteed unless you add an ORDER BY clause to the query. This means that theoretically you can have the results returned in a different order than mine.

When you need to join tables that are related based on multiple columns, the join is called a composite join and the ON clause typically consists of a conjunction of predicates (predicates separated by AND operators) that match the corresponding columns from the two sides. Sometimes you need more complex predicates, especially when NULLs are involved. I’ll demonstrate this by using a pair of tables. One table is called EmpLocations and it holds employee locations and the number of employees in each location. Another table is called CustLocations and it holds customer locations and the number of customers in each location. Run the following code to create these tables and populate them with sample data:

DROP TABLE IF EXISTS dbo.EmpLocations;

SELECT country, region, city, COUNT(*) AS numemps
INTO dbo.EmpLocations
FROM HR.Employees
GROUP BY country, region, city;

ALTER TABLE dbo.EmpLocations ADD CONSTRAINT UNQ_EmpLocations
  UNIQUE CLUSTERED(country, region, city);

DROP TABLE IF EXISTS dbo.CustLocations;

SELECT country, region, city, COUNT(*) AS numcusts
INTO dbo.CustLocations
FROM Sales.Customers
GROUP BY country, region, city;

ALTER TABLE dbo.CustLocations ADD CONSTRAINT UNQ_CustLocations
  UNIQUE CLUSTERED(country, region, city);

There’s a key defined in both tables on the location attributes: country, region, and city. Instead of using a primary key constraint I used a unique constraint to enforce the key because the region attribute allows NULLs, and between the two types of constraints, only the latter allows NULLs. I also specified the CLUSTERED keyword in the unique constraint definitions to have SQL Server create a clustered index type to enforce the constraint’s uniqueness property. This index will be beneficial in supporting joins between the tables based on the location attributes as well filters based on those attributes.

Query the EmpLocations table to see its contents:

SELECT country, region, city, numemps
FROM dbo.EmpLocations;

This query generates the following output:

country         region          city            numemps
--------------- --------------- --------------- -----------
UK              NULL            London          4
USA             WA              Kirkland        1
USA             WA              Redmond         1
USA             WA              Seattle         2
USA             WA              Tacoma          1

Query the CustLocations table:

SELECT country, region, city, numcusts
FROM dbo.CustLocations;

This query generates the following output, shown here in abbreviated form:

country         region          city            numcusts
--------------- --------------- --------------- -----------
Argentina       NULL            Buenos Aires    3
Austria         NULL            Graz            1
Austria         NULL            Salzburg        1
Belgium         NULL            Bruxelles       1
Belgium         NULL            Charleroi       1
Brazil          RJ              Rio de Janeiro  3
Brazil          SP              Campinas        1
Brazil          SP              Resende         1
Brazil          SP              Sao Paulo       4
Canada          BC              Tsawassen       1
...

(69 row(s) affected)

Suppose that you needed to join the two tables returning only matched locations, with both the employee and customer counts returned along with the location attributes. Your first attempt might be to write a composite join with an ON clause that has a conjunction of simple equality predicates as follows:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER JOIN dbo.CustLocations AS CL
    ON EL.country = CL.country
    AND EL.region = CL.region
    AND EL.city = CL.city;

This query generates the following output:

country         region          city            numemps     numcusts
--------------- --------------- --------------- ----------- -----------
USA             WA              Kirkland        1           1
USA             WA              Seattle         2           1

The problem is that the region column supports NULLs representing cases where the region is irrelevant (missing but inapplicable) and when you compare NULLs with an equality-based predicate the result is the logical value unknown, in which case the row is discarded. For instance, the location UK, NULL, London appears in both tables, and therefore you expect to see it in the result of the join, but you don’t. A common way for people to resolve this problem is to use the ISNULL or COALESCE functions to substitute a NULL in both sides with a value that can’t normally appear in the data, and this way when both sides are NULL you get a true back from the comparison. Here’s an example for implementing this solution using the ISNULL function:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER JOIN dbo.CustLocations AS CL
    ON EL.country = CL.country
    AND ISNULL(EL.region, N'<N/A>') = ISNULL(CL.region, N'<N/A>')
    AND EL.city = CL.city;

This time the query generates the correct result:

country         region          city            numemps     numcusts
--------------- --------------- --------------- ----------- -----------
UK              NULL            London          4           6
USA             WA              Kirkland        1           1
USA             WA              Seattle         2           1

The problem with this approach is that once you apply manipulation to a column, SQL Server cannot trust that the result values preserve the same ordering behavior as the original values. This can negatively affect the ability of SQL Server to rely on index ordering when it optimizes the query. Our query gets the plan shown in Figure 1-11.

Figure 1-11

Figure 1-11 Plan for query with ISNULL

The plan scans the clustered index on EmpLocations, and for each row (employee location) performs a seek in the clustered index on CustLocations. However, notice that the seek relies on only the country attribute in the seek predicate. It cannot rely on the region and city attributes because of the manipulation that you applied to the region attribute. The predicates involving the region and city attributes appear as residual predicates (under the Predicate property). This means that for each employee location row, the Clustered Index Seek operator that is applied to the CustLocations index performs a range scan of the entire customer location’s country that is equal to the current employee location’s country. The residual predicates that are based on region and city then determine whether to keep or discard each row. That’s a lot of unnecessary effort.

The optimizer picked the nested loops strategy in the plan shown in Figure 1-11 because the sample tables that we used are so tiny. With bigger, more realistic, table sizes, the optimizer typically chooses a merge join algorithm when the data is preordered by the join columns in both sides. This algorithm processes both sides of the join based on join column order, and in a way, zips matching rows together. The data can either be pulled preordered from an index, or explicitly sorted. As mentioned, applying manipulation to join columns breaks the ordering property of the data, and therefore even if it’s preordered in an index, the optimizer cannot trust this order. To illustrate how this can affect the merge algorithm, force it in our query by adding the MERGE join hint as follows.

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER MERGE JOIN dbo.CustLocations AS CL
    ON EL.country = CL.country
    AND ISNULL(EL.region, N'<N/A>') = ISNULL(CL.region, N'<N/A>')
    AND EL.city = CL.city;

The plan for this query is shown in Figure 1-12.

Figure 1-12

Figure 1-12 Plan for query with ISNULL and MERGE algorithm

Observe that the clustered indexes on both tables are scanned in an Orderd: False fashion, meaning that the scan is not requested to return the data in index order. Then the join columns sort both sides explicitly before being merged.

You can handle NULLs in a manner that gives you the desired logical meaning and that at the same time is considered order preserving by the optimizer using the predicate: (EL.region = CL.region OR (EL.region IS NULL AND CL.region IS NULL)). Here’s the complete solution query:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER JOIN dbo.CustLocations AS CL
    ON EL.country = CL.country
    AND (EL.region = CL.region OR (EL.region IS NULL AND CL.region IS NULL))
    AND EL.city = CL.city;

The plan for this query is shown in Figure 1-13.

Figure 1-13

Figure 1-13 Plan with order preservation

Notice that this time all predicates show up as seek predicates.

Similarly, with the new predicate, the optimizer can rely on index order when using the merge join algorithm. To demonstrate this, again, force this algorithm by adding the MERGE join hint as follows:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER MERGE JOIN dbo.CustLocations AS CL
    ON EL.country = CL.country
    AND (EL.region = CL.region OR (EL.region IS NULL AND CL.region IS NULL))
    AND EL.city = CL.city;

The plan for this query is shown in Figure 1-14.

Figure 1-14

Figure 1-14 Plan with order preservation and MERGE algorithm

Observe that the plan scans both clustered indexes in order, and that there’s no explicit sorting taking place prior to the merge join.

Recall that when set operators combine query results they compare corresponding attributes using distinctness and not equality, producing true when comparing two NULLs. However, one drawback that set operators have is that they compare complete rows. Unlike joins, which allow comparing a subset of the attributes and return additional ones in the result, set operators must compare all attributes from the two input queries. But in T-SQL, you can combine joins and set operators to benefit from the advantages of both tools. Namely, rely on the distinctness-based comparison of set operators and the ability of joins to return additional attributes beyond what you compare. In our querying task, the solution looks like this:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER JOIN dbo.CustLocations AS CL
    ON EXISTS (SELECT EL.country, EL.region, EL.city
               INTERSECT 
               SELECT CL.country, CL.region, CL.city);

For each row that is evaluated by the join, the set operator performs an intersection of the employee location attributes and customer location attributes using FROM-less SELECT statements, each producing one row. If the locations intersect, the result is one row, in which case the EXISTS predicate returns true, and the evaluated row is considered a match. If the locations don’t intersect, the result is an empty set, in which case the EXISTS predicate returns false, and the evaluated row is not considered a match. Remarkably, Microsoft added logic to the optimizer to consider this form order-preserving. The plan for this query is the same as the one shown earlier in Figure 1-13.

Use the following code to force the merge algorithm in the query:

SELECT EL.country, EL.region, EL.city, EL.numemps, CL.numcusts
FROM dbo.EmpLocations AS EL
  INNER MERGE JOIN dbo.CustLocations AS CL
    ON EXISTS (SELECT EL.country, EL.region, EL.city
               INTERSECT 
               SELECT CL.country, CL.region, CL.city);

Also here the ordering property of the data is preserved and you get the plan shown earlier in Figure 1-14, where the clustered indexes of both sides are scanned in order and there’s no need for explicit sorting prior to performing the merge join.

When you’re done, run the following code for cleanup:

DROP TABLE IF EXISTS dbo.CustLocations;
DROP TABLE IF EXISTS dbo.EmpLocations;

Multi-join queries

It’s important to remember that a join in T-SQL takes place conceptually between two tables at a time. A multi-join query evaluates the joins conceptually from left to right. So the result of one join is used as the left input to the next join. If you don’t understand this, you can end up with logical bugs, especially when outer joins are involved. (With inner and cross-joins, the order cannot affect the meaning.)

As an example, suppose that you wanted to return all suppliers from Japan, and matching products where relevant. For this, you need an outer join between Production.Suppliers and Production.Products, preserving Suppliers. But you also want to include product category information, so you add an inner join to Production.Categories, as follows:

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice,
  C.categoryname
FROM Production.Suppliers AS S
  LEFT OUTER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
  INNER JOIN Production.Categories AS C
    ON C.categoryid = P.categoryid
WHERE S.country = N'Japan';

This query generates the following output:

supplier        country productid  productname    unitprice  categoryname
--------------- ------- ---------- -------------- ---------- -------------
Supplier QOVFD  Japan   9          Product AOZBW  97.00      Meat/Poultry
Supplier QOVFD  Japan   10         Product YHXGE  31.00      Seafood
Supplier QWUSF  Japan   13         Product POXFU  6.00       Seafood
Supplier QWUSF  Japan   14         Product PWCJB  23.25      Produce
Supplier QWUSF  Japan   15         Product KSZOI  15.50      Condiments
Supplier QOVFD  Japan   74         Product BKAZJ  10.00      Produce

Supplier XYZ from Japan was discarded. Can you explain why?

Conceptually, the first join included outer rows (suppliers with no products) but produced NULLs as placeholders in the product attributes in those rows. Then the join to Production.Categories compared the NULLs in the categoryid column in the outer rows to categoryid values in Production.Categories, and discarded those rows. In short, the inner join that followed the outer join nullified the outer part of the join. In fact, if you look at the query plan for this query, you will find that the optimizer didn’t even bother to process the join between Production.Suppliers and Production.Products as an outer join. It detected the contradiction between the outer join and the subsequent inner join, and converted the first join to an inner join too.

There are a number of ways to address this problem. One is to use a LEFT OUTER in both joins, like so:

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice,
  C.categoryname
FROM Production.Suppliers AS S
  LEFT OUTER JOIN Production.Products AS P
    ON S.supplierid = P.supplierid
  LEFT OUTER JOIN Production.Categories AS C
    ON C.categoryid = P.categoryid
WHERE S.country = N'Japan';

Another option is to use an interesting capability in the language—separate some of the joins to their own independent logical phase. What you’re after is a left outer join between Production.Suppliers and the result of the inner join between Production.Products and Production.Categories. You can phrase your query exactly like this:

SELECT
  S.companyname AS supplier, S.country,
  P.productid, P.productname, P.unitprice,
  C.categoryname
FROM Production.Suppliers AS S
  LEFT OUTER JOIN 
    (Production.Products AS P
       INNER JOIN Production.Categories AS C
         ON C.categoryid = P.categoryid)
    ON S.supplierid = P.supplierid
WHERE S.country = N'Japan';

With both fixes, the query generates the correct result, including suppliers who currently don’t have related products:

supplier        country  productid  productname    unitprice  categoryname
--------------- -------- ---------- -------------- ---------- -------------
Supplier QOVFD  Japan    9          Product AOZBW  97.00      Meat/Poultry
Supplier QOVFD  Japan    10         Product YHXGE  31.00      Seafood
Supplier QOVFD  Japan    74         Product BKAZJ  10.00      Produce
Supplier QWUSF  Japan    13         Product POXFU  6.00       Seafood
Supplier QWUSF  Japan    14         Product PWCJB  23.25      Produce
Supplier QWUSF  Japan    15         Product KSZOI  15.50      Condiments
Supplier XYZ    Japan    NULL       NULL           NULL       NULL

Note that the important change that made the difference is the arrangement of the ON clauses with respect to the joined tables. The ON clause ordering is what defines the logical join ordering. Each ON clause must appear right below the two units that it joins. By specifying the ON clause that matches attributes from Production.Products and Production.Categories first, you set this inner join to be logically evaluated first. Then the second ON clause handles the left outer join by matching an attribute from Production.Suppliers with an attribute from the result of the inner join. Curiously, T-SQL doesn’t really require the parentheses that I added to the query; remove those and rerun the query, and you will see that it runs successfully. However, it’s recommended to use those for clarity.

When you’re done, run the following code to delete the supplier row that you added at the beginning of this skill:

DELETE FROM Production.Suppliers WHERE supplierid > 29;