What are JOIN Operators
A join operator is a type of an algorithm that is used by the SQL Server Optimizer chooses in order to implement logical joins between two sets of data. Selection of algorithm different scenarios like the requested query, available indexes, statistics and number of estimated rows in each data set. Today, we read about available JOIN operator types in SQL Server (Nested Loops, Hash and Merge Joins), we also read about their complexity.
Firstly, we create two table and after that we applied join operator on both tables.
Student_Information Table
Branch_Info Table
Nested Loops
Nested loops is the simplest operator that works on bunch of data. The nested loops join, also called nested iteration, uses one join input as the outer input table (shown as the top input in the graphical execution plan) and one as the inner (bottom) input table. The outer loop consumes the outer input table row by row. The inner loop, executed for each outer row, searches for matching rows in the inner input table.
Example
- SELECT * FROM dbo.Branch_InfoAS BRA
- INNERJOIN
- dbo.Student_Information AS STU
- ON
- BRA.Branch_Id=STU.Branch_Id
Resulting execution plan for above query:
In above execution plan the operator on the top right is called the outer input (Student_Information) and the one just below it is called the inner input (Branch_Info). The outer input maintained a “Index Scan” label and inner input maintained an “
Index Seek”.
Index Scan
Since a scan touches every row in the table whether or not it qualifies, the cost is proportional to the total number of rows in the table. Thus, a scan is an efficient strategy if the table is small or if most of the rows qualify for the predicate.
In above index scan we can see that outer input only executed once and generated 12 rows.
Index Seek
Since a seek only touches rows that qualify and pages that contain these qualifying rows, the cost is proportional to the number of qualifying rows and pages rather than to the total number of rows in the table.
In above index seek we can see that number of execution is 12 that verify the inner input executed for each row from outer input.
Conclusion is that clustered index scan we see as the outer input is executed once to get all the relevant records, and the Non-Clustered index seek we see below is executed for each record from the outer input. We can see that execution plan contain two nested loops one for outer and inner input. Second nested loop is maintained for RID Lookup.
What is RID Lookup RID Lookup is a bookmark lookup on a heap using a supplied row identifier (RID). The Argument column contains the bookmark label used to look up the row in the table and the name of the table in which the row is looked up. A RID Lookup is a lookup into a heap table using a Row ID. The Row ID is included in a non-clustered index in order to find the rest of a table's data in the heap table, since a heap table is a table without a clustered index and is sorted unordered a Row ID is required for the correlation.
Complexity of Nested Loop
Suppose “
N” is number of rows from outer input and “
M” is total number of rows in inner Input, then complexity will be
O(Nlog(M)).
When to use
Used usually when one table is significantly small. If one join input is small (fewer than 10 rows) and the other join input is fairly large and indexed on its join columns, an index nested loops join is the fastest join operation because they require the least I/O and the fewest comparisons.
MERGE Join
The Merge Join provides an output that is generated by joining two sorted data sets using a FULL, LEFT, or INNER join. Merge Join requires that both inputs be sorted and that the joined columns have matching meta-data. It means we can’t join a column that has a numeric data type with a column that has character data type. We both column are string type then length of the column in the second input must be less than or equal to the length of the column in the first input with which it is merged. The merge join operation may be either a regular or a many-to-many operation. A many-to-many merge join uses a temporary table to store rows. If there are duplicate values from each input, one of the inputs will have to rewind to the start of the duplicates as each duplicate from the other input is processed.
Example: - SELECT * FROM dbo.Branch_Info AS BRA
- INNERJOIN
- dbo.Student_InformationAS STU
- ON
- BRA.Branch_Id=STU.Branch_Id
- OPTION (MERGEJOIN)
Execution Plan We can see that execution plan for Merge Join is completely different from the execution plan of Nested Loop. The following operations are performed in above execution plan.
- Nested loop is performed, in which Branch_Info work as outer input and “RID Lookup” work as inner input for Nested loop.
- Sort operation is performed for “Student_Information” table.
- Merge JOIN is performed on the data that is obtained from the step1 and step2.
Advantages of Merge Join Over Nested Loop:
- Both input operators in Merge Join are executed only once.
- The Merge Join simultaneously reads a row from each input and compares them using the join key. If there’s a match found then they are returned else the row with the smaller value can be discarded because, since both inputs are sorted, the discarded row will not match any other row on the other set of data.
Complexity of Merge Join
The maximum cost of a Merge Join is the sum of both inputs, so complexity of merge join can be defined as O(N+M).
When to use
If the two join inputs are not small but are sorted on their join column (for example, if they were obtained by scanning sorted indexes), a merge join is the fastest join operation. Merge join is excellent for very large tables.
HASH Match
The Hash Match physical operator builds a hash table by computing a hash value for each row from its build input. It’s the one operator chosen when the scenario doesn’t favor in any of the other join types or tables are not properly sorted, and/or there are no indexes.
Example:
- SELECT * FROM dbo.Branch_Info AS BRA
- INNERJOIN
- dbo.Student_Information AS STU
- ON
- BRA.Branch_Id=STU.Branch_Id
- OPTION (HASHJOIN)
Execution Plan
Hash join use the first (top) input to build the hash table and the second (bottom) input to probe the hash table.
Hash join use two important terms “
Hashing Function” and “
Hash table”.
Hashing Function
Hashing function is a function that takes one or more values and converts them to a single symbolic value usually in numeric. If we provide the same value as input to has function we will always get the same symbolic value as output. Hash function is one way function that means we can convert back symbolic value to original value.
Hash Table
Hash table is a data structure that divides all rows into equal-sized buckets, where each “
bucket” is represented by a hash value.
For the distinct or aggregate operators, use the input to build the hash table (removing duplicates and computing any aggregate expressions). When the hash table is built, scan the table and output all entries.
How HASH Match Work SQL SERVER first choose the smaller table as built input and create a hash table in memory. If there is not enough memory for the hash table, SQL Server will use physical disk space in TEMPDB. After the hash table is created, now SQL Server will get the data from probe input and compare it to hash table using the hash function and return the matched rows.
Complexity of HASH Match
Complexity for HASH Match is defined as:
O(N*hc+M*hm+J)
Where:
N: Smaller data set
M: Larger data set
Hc: Complexity of the hash table
Hm: Complexity of the hash match function
J: Complexity addition for the dynamic calculation and creation of the hash function.
When to use
Hash joins can efficiently process large, unsorted, non indexed inputs.
Conclusion
If we use an efficient Join Operator according the data tables, it can increase the performance query else it can decrease the performance. So first examine the size data in tables, after that select an appropriate Join operator.