Sometimes it is required to fetch random records from a database. The simplest
approach is to use NewID() with the order by clause. Well I have no problem with
that but when you have a large table to query then it'll start showing issues of performance. Why? I'll try to answer this question with alternative
solution.
To start with I have the Adventure works database with around 260000 records in a table
and we have a query that will take a random record from it.
SELECT TOP 1
[BusinessEntityID] FROM HumanResources.EmployeePayHistory
ORDERBY NEWID()
The problem with this query is, when you look at the execution plan then you'll
find it performs a sort operation before it picks a top record.
So 91% of the cost is due to the sort operation. Well this will happen if you
use NewID() to pick the random records it first sorts the records based on NewID()
which is the worst case for sorting based on GUID values.
Now to get rid of Sort operation of NEWID() by- BINARY_CHECKSUM
The best approach is to use the BinaryCheckSum instead of NewID to reduce the sort
operation to perform on NEWID column generated in Memory.
SELECT TOP 1
[BusinessEntityID]
FROM HumanResources.EmployeePayHistory
WHERE (ABS(CAST(
(BINARY_CHECKSUM(*) *
RAND()) as int)) % 100) < 10
The BINARY_CHECKSUM function generates a checksum value that is based on the
values of the columns that you specify. If two rows are different, they
typically will generate different checksum numbers. The BINARY_CHECKSUM function
generally is used to verify whether any of the columns in a row in a table have
changed. However, for our purposes, it generates a number that looks like a
random number for each row.
Another form of using BinaryCheckSum can be:
SELECT TOP 1
[BusinessEntityID]
FROM HumanResources.EmployeePayHistory
WHERE
(ABS(CAST(
(BINARY_CHECKSUM
([BusinessEntityID], NEWID())) as int))
% 100) < 10
Both forms of Binary_Cheksum will have the same query Execution plan with
similar cost.
Now let's have a look at the Figures analyzed by the Microsoft team to compare
performance of both queries.
13 million rows | | | | | |
NEWID query | 253 | 347,420 | 13,810,132 | 14,157,552 | 422,891 |
BINARY_CHECKSUM query | 21 | 347,420 | 0 | 347,420 | 49,203 |
And in our case here I've also did some
comparison based on Query plans of Old query of NEWID and new query with
BINARY_CHECKSUM
Just another tip on improving performance but if you have to pick random records
from small tables in SQL then you can still use NEWID() with no issues.
Let me know you rfeedback. Please do comments if you like to.