Why is this faster and is it safe to use? (WHERE first letter is in the alphabet)
Long story short, we're updating small tables of people with values from a very large table of people. In a recent test, this update takes ca 5 minutes to run.
We stumbled upon what seems like the silliest optimization possible, that seemingly works perfectly! The same query now runs in less than 2 minutes and produces the same results, perfectly.
Here is the query. The last line is added as "the optimization". Why the intense decrease in query time? Are we missing something? Could this lead to problems in the future?
UPDATE smallTbl
SET smallTbl.importantValue = largeTbl.importantValue
FROM smallTableOfPeople smallTbl
JOIN largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(TRIM(smallTbl.last_name),TRIM(largeTbl.last_name)) = 4
AND DIFFERENCE(TRIM(smallTbl.first_name),TRIM(largeTbl.first_name)) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(TRIM(largeTbl.last_name), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
Technical notes: We are aware that the list of letters to test might need a few more letters. We are also aware of the obvious margin for error when using "DIFFERENCE".
Query plan (regular): https://www.brentozar.com/pastetheplan/?id=rypV84y7V
Query plan (with "optimization"): https://www.brentozar.com/pastetheplan/?id=r1aC2my7E
sql-server optimization sql-server-2017
New contributor
|
show 9 more comments
Long story short, we're updating small tables of people with values from a very large table of people. In a recent test, this update takes ca 5 minutes to run.
We stumbled upon what seems like the silliest optimization possible, that seemingly works perfectly! The same query now runs in less than 2 minutes and produces the same results, perfectly.
Here is the query. The last line is added as "the optimization". Why the intense decrease in query time? Are we missing something? Could this lead to problems in the future?
UPDATE smallTbl
SET smallTbl.importantValue = largeTbl.importantValue
FROM smallTableOfPeople smallTbl
JOIN largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(TRIM(smallTbl.last_name),TRIM(largeTbl.last_name)) = 4
AND DIFFERENCE(TRIM(smallTbl.first_name),TRIM(largeTbl.first_name)) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(TRIM(largeTbl.last_name), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
Technical notes: We are aware that the list of letters to test might need a few more letters. We are also aware of the obvious margin for error when using "DIFFERENCE".
Query plan (regular): https://www.brentozar.com/pastetheplan/?id=rypV84y7V
Query plan (with "optimization"): https://www.brentozar.com/pastetheplan/?id=r1aC2my7E
sql-server optimization sql-server-2017
New contributor
1
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Query plan added
– JohnF
15 hours ago
4
Small reply to your technical note:AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read
– Erik von Asmuth
12 hours ago
1
Also, how isTRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?
– Solomon Rutzky
10 hours ago
1
@Joshua - no you can't. You could createdbo.Trim
but that isn't interchangeable withTRIM
– Martin Smith
9 hours ago
|
show 9 more comments
Long story short, we're updating small tables of people with values from a very large table of people. In a recent test, this update takes ca 5 minutes to run.
We stumbled upon what seems like the silliest optimization possible, that seemingly works perfectly! The same query now runs in less than 2 minutes and produces the same results, perfectly.
Here is the query. The last line is added as "the optimization". Why the intense decrease in query time? Are we missing something? Could this lead to problems in the future?
UPDATE smallTbl
SET smallTbl.importantValue = largeTbl.importantValue
FROM smallTableOfPeople smallTbl
JOIN largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(TRIM(smallTbl.last_name),TRIM(largeTbl.last_name)) = 4
AND DIFFERENCE(TRIM(smallTbl.first_name),TRIM(largeTbl.first_name)) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(TRIM(largeTbl.last_name), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
Technical notes: We are aware that the list of letters to test might need a few more letters. We are also aware of the obvious margin for error when using "DIFFERENCE".
Query plan (regular): https://www.brentozar.com/pastetheplan/?id=rypV84y7V
Query plan (with "optimization"): https://www.brentozar.com/pastetheplan/?id=r1aC2my7E
sql-server optimization sql-server-2017
New contributor
Long story short, we're updating small tables of people with values from a very large table of people. In a recent test, this update takes ca 5 minutes to run.
We stumbled upon what seems like the silliest optimization possible, that seemingly works perfectly! The same query now runs in less than 2 minutes and produces the same results, perfectly.
Here is the query. The last line is added as "the optimization". Why the intense decrease in query time? Are we missing something? Could this lead to problems in the future?
UPDATE smallTbl
SET smallTbl.importantValue = largeTbl.importantValue
FROM smallTableOfPeople smallTbl
JOIN largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(TRIM(smallTbl.last_name),TRIM(largeTbl.last_name)) = 4
AND DIFFERENCE(TRIM(smallTbl.first_name),TRIM(largeTbl.first_name)) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(TRIM(largeTbl.last_name), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
Technical notes: We are aware that the list of letters to test might need a few more letters. We are also aware of the obvious margin for error when using "DIFFERENCE".
Query plan (regular): https://www.brentozar.com/pastetheplan/?id=rypV84y7V
Query plan (with "optimization"): https://www.brentozar.com/pastetheplan/?id=r1aC2my7E
sql-server optimization sql-server-2017
sql-server optimization sql-server-2017
New contributor
New contributor
edited 9 hours ago
Martin Smith
62.2k10167249
62.2k10167249
New contributor
asked 16 hours ago
JohnFJohnF
414
414
New contributor
New contributor
1
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Query plan added
– JohnF
15 hours ago
4
Small reply to your technical note:AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read
– Erik von Asmuth
12 hours ago
1
Also, how isTRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?
– Solomon Rutzky
10 hours ago
1
@Joshua - no you can't. You could createdbo.Trim
but that isn't interchangeable withTRIM
– Martin Smith
9 hours ago
|
show 9 more comments
1
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Query plan added
– JohnF
15 hours ago
4
Small reply to your technical note:AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read
– Erik von Asmuth
12 hours ago
1
Also, how isTRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?
– Solomon Rutzky
10 hours ago
1
@Joshua - no you can't. You could createdbo.Trim
but that isn't interchangeable withTRIM
– Martin Smith
9 hours ago
1
1
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Query plan added
– JohnF
15 hours ago
Query plan added
– JohnF
15 hours ago
4
4
Small reply to your technical note:
AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read– Erik von Asmuth
12 hours ago
Small reply to your technical note:
AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read– Erik von Asmuth
12 hours ago
1
1
Also, how is
TRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?– Solomon Rutzky
10 hours ago
Also, how is
TRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?– Solomon Rutzky
10 hours ago
1
1
@Joshua - no you can't. You could create
dbo.Trim
but that isn't interchangeable with TRIM
– Martin Smith
9 hours ago
@Joshua - no you can't. You could create
dbo.Trim
but that isn't interchangeable with TRIM
– Martin Smith
9 hours ago
|
show 9 more comments
1 Answer
1
active
oldest
votes
It depends on the data in your tables, your indexes, .... Hard to say without being able to compare the execution plans / the io + time statistics.
The difference I would expect is the extra filtering happening before the JOIN between the two tables.
In my example, I changed the updates to selects to reuse my tables.
The execution plan with "the optimization"
Execution Plan
You clearly see a filter operation happening, in my test data no records where filtered out and as a result no improvements where made.
The execution plan, without "the optimization"
Execution Plan
The filter is gone, which means that we will have to rely on the join to filter out unneeded records.
Other reason(s)
Another reason / consequence of changing the query could be, that a new execution plan was created when changing the query, which happens to be faster.
An example of this is the engine choosing a different Join operator, but that is just guessing at this point.
EDIT:
Clarifying after getting the two query plans:
The query is reading 550M Rows from the big table, and filtering them out.
Meaning that the predicate is the one doing most of the filtering, not the seek predicate. Resulting in the data being read, but way less being returned.
Making sql server use a different index (query plan) / adding an index could resolve this.
So why doesn't the optimize query have this same issue?
Because a different query plan is used, with a scan instead of a seek.
Without doing any seeks, but only returning 4M rows to work with.
Next difference
Disregarding the update difference ( nothing is being updated on the optimized query)
a hash match is used on the optimized query:
Instead of a nested loop join on the non-optimized:
A nested loop is best when one table is small and the other one big. Since they are both close to the same size I would argue that the hash match is the better choice in this case.
Overview
The optimized query
The Optimized query's plan has parallellism, uses a hash match join, and needs to do less residual IO filtering. It also uses a bitmap to eliminate key values that cannot produce any join rows . (Also nothing is being updated)
The non-optimized query
The non-Optimized query's plan has no parallellism, uses a nested loop join, and needs to do residual IO filtering on 550M records. (Also the update is happening)
What could you do to improve the non-optimized query?
Changing the index to have first_name & last_name in the key column
list:
CREATE INDEX IX_largeTableOfPeople_birth_date_first_name_last_name
on dbo.largeTableOfPeople(birth_date,first_name,last_name)
include(id)
But due to the use of functions and this table being big this might not be the optimal solution.
- Updating statistics, using recompile to try and get the better plan.
- Adding OPTION
(HASH JOIN, MERGE JOIN)
to the query - ...
Test data + Queries used
CREATE TABLE #smallTableOfPeople(importantValue int, birthDate datetime2, first_name varchar(50),last_name varchar(50));
CREATE TABLE #largeTableOfPeople(importantValue int, birth_date datetime2, first_name varchar(50),last_name varchar(50));
set nocount on;
DECLARE @i int = 1
WHILE @i <= 1000
BEGIN
insert into #smallTableOfPeople (importantValue,birthDate,first_name,last_name)
VALUES(NULL, dateadd(mi,@i,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @i += 1;
END
set nocount on;
DECLARE @j int = 1
WHILE @j <= 20000
BEGIN
insert into #largeTableOfPeople (importantValue,birth_Date,first_name,last_name)
VALUES(@j, dateadd(mi,@j,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @j += 1;
END
SET STATISTICS IO, TIME ON;
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å');
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
--AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
drop table #largeTableOfPeople;
drop table #smallTableOfPeople;
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
JohnF is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227468%2fwhy-is-this-faster-and-is-it-safe-to-use-where-first-letter-is-in-the-alphabet%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It depends on the data in your tables, your indexes, .... Hard to say without being able to compare the execution plans / the io + time statistics.
The difference I would expect is the extra filtering happening before the JOIN between the two tables.
In my example, I changed the updates to selects to reuse my tables.
The execution plan with "the optimization"
Execution Plan
You clearly see a filter operation happening, in my test data no records where filtered out and as a result no improvements where made.
The execution plan, without "the optimization"
Execution Plan
The filter is gone, which means that we will have to rely on the join to filter out unneeded records.
Other reason(s)
Another reason / consequence of changing the query could be, that a new execution plan was created when changing the query, which happens to be faster.
An example of this is the engine choosing a different Join operator, but that is just guessing at this point.
EDIT:
Clarifying after getting the two query plans:
The query is reading 550M Rows from the big table, and filtering them out.
Meaning that the predicate is the one doing most of the filtering, not the seek predicate. Resulting in the data being read, but way less being returned.
Making sql server use a different index (query plan) / adding an index could resolve this.
So why doesn't the optimize query have this same issue?
Because a different query plan is used, with a scan instead of a seek.
Without doing any seeks, but only returning 4M rows to work with.
Next difference
Disregarding the update difference ( nothing is being updated on the optimized query)
a hash match is used on the optimized query:
Instead of a nested loop join on the non-optimized:
A nested loop is best when one table is small and the other one big. Since they are both close to the same size I would argue that the hash match is the better choice in this case.
Overview
The optimized query
The Optimized query's plan has parallellism, uses a hash match join, and needs to do less residual IO filtering. It also uses a bitmap to eliminate key values that cannot produce any join rows . (Also nothing is being updated)
The non-optimized query
The non-Optimized query's plan has no parallellism, uses a nested loop join, and needs to do residual IO filtering on 550M records. (Also the update is happening)
What could you do to improve the non-optimized query?
Changing the index to have first_name & last_name in the key column
list:
CREATE INDEX IX_largeTableOfPeople_birth_date_first_name_last_name
on dbo.largeTableOfPeople(birth_date,first_name,last_name)
include(id)
But due to the use of functions and this table being big this might not be the optimal solution.
- Updating statistics, using recompile to try and get the better plan.
- Adding OPTION
(HASH JOIN, MERGE JOIN)
to the query - ...
Test data + Queries used
CREATE TABLE #smallTableOfPeople(importantValue int, birthDate datetime2, first_name varchar(50),last_name varchar(50));
CREATE TABLE #largeTableOfPeople(importantValue int, birth_date datetime2, first_name varchar(50),last_name varchar(50));
set nocount on;
DECLARE @i int = 1
WHILE @i <= 1000
BEGIN
insert into #smallTableOfPeople (importantValue,birthDate,first_name,last_name)
VALUES(NULL, dateadd(mi,@i,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @i += 1;
END
set nocount on;
DECLARE @j int = 1
WHILE @j <= 20000
BEGIN
insert into #largeTableOfPeople (importantValue,birth_Date,first_name,last_name)
VALUES(@j, dateadd(mi,@j,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @j += 1;
END
SET STATISTICS IO, TIME ON;
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å');
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
--AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
drop table #largeTableOfPeople;
drop table #smallTableOfPeople;
add a comment |
It depends on the data in your tables, your indexes, .... Hard to say without being able to compare the execution plans / the io + time statistics.
The difference I would expect is the extra filtering happening before the JOIN between the two tables.
In my example, I changed the updates to selects to reuse my tables.
The execution plan with "the optimization"
Execution Plan
You clearly see a filter operation happening, in my test data no records where filtered out and as a result no improvements where made.
The execution plan, without "the optimization"
Execution Plan
The filter is gone, which means that we will have to rely on the join to filter out unneeded records.
Other reason(s)
Another reason / consequence of changing the query could be, that a new execution plan was created when changing the query, which happens to be faster.
An example of this is the engine choosing a different Join operator, but that is just guessing at this point.
EDIT:
Clarifying after getting the two query plans:
The query is reading 550M Rows from the big table, and filtering them out.
Meaning that the predicate is the one doing most of the filtering, not the seek predicate. Resulting in the data being read, but way less being returned.
Making sql server use a different index (query plan) / adding an index could resolve this.
So why doesn't the optimize query have this same issue?
Because a different query plan is used, with a scan instead of a seek.
Without doing any seeks, but only returning 4M rows to work with.
Next difference
Disregarding the update difference ( nothing is being updated on the optimized query)
a hash match is used on the optimized query:
Instead of a nested loop join on the non-optimized:
A nested loop is best when one table is small and the other one big. Since they are both close to the same size I would argue that the hash match is the better choice in this case.
Overview
The optimized query
The Optimized query's plan has parallellism, uses a hash match join, and needs to do less residual IO filtering. It also uses a bitmap to eliminate key values that cannot produce any join rows . (Also nothing is being updated)
The non-optimized query
The non-Optimized query's plan has no parallellism, uses a nested loop join, and needs to do residual IO filtering on 550M records. (Also the update is happening)
What could you do to improve the non-optimized query?
Changing the index to have first_name & last_name in the key column
list:
CREATE INDEX IX_largeTableOfPeople_birth_date_first_name_last_name
on dbo.largeTableOfPeople(birth_date,first_name,last_name)
include(id)
But due to the use of functions and this table being big this might not be the optimal solution.
- Updating statistics, using recompile to try and get the better plan.
- Adding OPTION
(HASH JOIN, MERGE JOIN)
to the query - ...
Test data + Queries used
CREATE TABLE #smallTableOfPeople(importantValue int, birthDate datetime2, first_name varchar(50),last_name varchar(50));
CREATE TABLE #largeTableOfPeople(importantValue int, birth_date datetime2, first_name varchar(50),last_name varchar(50));
set nocount on;
DECLARE @i int = 1
WHILE @i <= 1000
BEGIN
insert into #smallTableOfPeople (importantValue,birthDate,first_name,last_name)
VALUES(NULL, dateadd(mi,@i,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @i += 1;
END
set nocount on;
DECLARE @j int = 1
WHILE @j <= 20000
BEGIN
insert into #largeTableOfPeople (importantValue,birth_Date,first_name,last_name)
VALUES(@j, dateadd(mi,@j,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @j += 1;
END
SET STATISTICS IO, TIME ON;
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å');
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
--AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
drop table #largeTableOfPeople;
drop table #smallTableOfPeople;
add a comment |
It depends on the data in your tables, your indexes, .... Hard to say without being able to compare the execution plans / the io + time statistics.
The difference I would expect is the extra filtering happening before the JOIN between the two tables.
In my example, I changed the updates to selects to reuse my tables.
The execution plan with "the optimization"
Execution Plan
You clearly see a filter operation happening, in my test data no records where filtered out and as a result no improvements where made.
The execution plan, without "the optimization"
Execution Plan
The filter is gone, which means that we will have to rely on the join to filter out unneeded records.
Other reason(s)
Another reason / consequence of changing the query could be, that a new execution plan was created when changing the query, which happens to be faster.
An example of this is the engine choosing a different Join operator, but that is just guessing at this point.
EDIT:
Clarifying after getting the two query plans:
The query is reading 550M Rows from the big table, and filtering them out.
Meaning that the predicate is the one doing most of the filtering, not the seek predicate. Resulting in the data being read, but way less being returned.
Making sql server use a different index (query plan) / adding an index could resolve this.
So why doesn't the optimize query have this same issue?
Because a different query plan is used, with a scan instead of a seek.
Without doing any seeks, but only returning 4M rows to work with.
Next difference
Disregarding the update difference ( nothing is being updated on the optimized query)
a hash match is used on the optimized query:
Instead of a nested loop join on the non-optimized:
A nested loop is best when one table is small and the other one big. Since they are both close to the same size I would argue that the hash match is the better choice in this case.
Overview
The optimized query
The Optimized query's plan has parallellism, uses a hash match join, and needs to do less residual IO filtering. It also uses a bitmap to eliminate key values that cannot produce any join rows . (Also nothing is being updated)
The non-optimized query
The non-Optimized query's plan has no parallellism, uses a nested loop join, and needs to do residual IO filtering on 550M records. (Also the update is happening)
What could you do to improve the non-optimized query?
Changing the index to have first_name & last_name in the key column
list:
CREATE INDEX IX_largeTableOfPeople_birth_date_first_name_last_name
on dbo.largeTableOfPeople(birth_date,first_name,last_name)
include(id)
But due to the use of functions and this table being big this might not be the optimal solution.
- Updating statistics, using recompile to try and get the better plan.
- Adding OPTION
(HASH JOIN, MERGE JOIN)
to the query - ...
Test data + Queries used
CREATE TABLE #smallTableOfPeople(importantValue int, birthDate datetime2, first_name varchar(50),last_name varchar(50));
CREATE TABLE #largeTableOfPeople(importantValue int, birth_date datetime2, first_name varchar(50),last_name varchar(50));
set nocount on;
DECLARE @i int = 1
WHILE @i <= 1000
BEGIN
insert into #smallTableOfPeople (importantValue,birthDate,first_name,last_name)
VALUES(NULL, dateadd(mi,@i,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @i += 1;
END
set nocount on;
DECLARE @j int = 1
WHILE @j <= 20000
BEGIN
insert into #largeTableOfPeople (importantValue,birth_Date,first_name,last_name)
VALUES(@j, dateadd(mi,@j,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @j += 1;
END
SET STATISTICS IO, TIME ON;
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å');
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
--AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
drop table #largeTableOfPeople;
drop table #smallTableOfPeople;
It depends on the data in your tables, your indexes, .... Hard to say without being able to compare the execution plans / the io + time statistics.
The difference I would expect is the extra filtering happening before the JOIN between the two tables.
In my example, I changed the updates to selects to reuse my tables.
The execution plan with "the optimization"
Execution Plan
You clearly see a filter operation happening, in my test data no records where filtered out and as a result no improvements where made.
The execution plan, without "the optimization"
Execution Plan
The filter is gone, which means that we will have to rely on the join to filter out unneeded records.
Other reason(s)
Another reason / consequence of changing the query could be, that a new execution plan was created when changing the query, which happens to be faster.
An example of this is the engine choosing a different Join operator, but that is just guessing at this point.
EDIT:
Clarifying after getting the two query plans:
The query is reading 550M Rows from the big table, and filtering them out.
Meaning that the predicate is the one doing most of the filtering, not the seek predicate. Resulting in the data being read, but way less being returned.
Making sql server use a different index (query plan) / adding an index could resolve this.
So why doesn't the optimize query have this same issue?
Because a different query plan is used, with a scan instead of a seek.
Without doing any seeks, but only returning 4M rows to work with.
Next difference
Disregarding the update difference ( nothing is being updated on the optimized query)
a hash match is used on the optimized query:
Instead of a nested loop join on the non-optimized:
A nested loop is best when one table is small and the other one big. Since they are both close to the same size I would argue that the hash match is the better choice in this case.
Overview
The optimized query
The Optimized query's plan has parallellism, uses a hash match join, and needs to do less residual IO filtering. It also uses a bitmap to eliminate key values that cannot produce any join rows . (Also nothing is being updated)
The non-optimized query
The non-Optimized query's plan has no parallellism, uses a nested loop join, and needs to do residual IO filtering on 550M records. (Also the update is happening)
What could you do to improve the non-optimized query?
Changing the index to have first_name & last_name in the key column
list:
CREATE INDEX IX_largeTableOfPeople_birth_date_first_name_last_name
on dbo.largeTableOfPeople(birth_date,first_name,last_name)
include(id)
But due to the use of functions and this table being big this might not be the optimal solution.
- Updating statistics, using recompile to try and get the better plan.
- Adding OPTION
(HASH JOIN, MERGE JOIN)
to the query - ...
Test data + Queries used
CREATE TABLE #smallTableOfPeople(importantValue int, birthDate datetime2, first_name varchar(50),last_name varchar(50));
CREATE TABLE #largeTableOfPeople(importantValue int, birth_date datetime2, first_name varchar(50),last_name varchar(50));
set nocount on;
DECLARE @i int = 1
WHILE @i <= 1000
BEGIN
insert into #smallTableOfPeople (importantValue,birthDate,first_name,last_name)
VALUES(NULL, dateadd(mi,@i,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @i += 1;
END
set nocount on;
DECLARE @j int = 1
WHILE @j <= 20000
BEGIN
insert into #largeTableOfPeople (importantValue,birth_Date,first_name,last_name)
VALUES(@j, dateadd(mi,@j,'2018-01-18 11:05:29.067'),'Frodo','Baggins');
set @j += 1;
END
SET STATISTICS IO, TIME ON;
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å');
SELECT smallTbl.importantValue , largeTbl.importantValue
FROM #smallTableOfPeople smallTbl
JOIN #largeTableOfPeople largeTbl
ON largeTbl.birth_date = smallTbl.birthDate
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.last_name)),RTRIM(LTRIM(largeTbl.last_name))) = 4
AND DIFFERENCE(RTRIM(LTRIM(smallTbl.first_name)),RTRIM(LTRIM(largeTbl.first_name))) = 4
WHERE smallTbl.importantValue IS NULL
-- The following line is "the optimization"
--AND LEFT(RTRIM(LTRIM(largeTbl.last_name)), 1) IN ('a','à','á','b','c','d','e','è','é','f','g','h','i','j','k','l','m','n','o','ô','ö','p','q','r','s','t','u','ü','v','w','x','y','z','æ','ä','ø','å')
drop table #largeTableOfPeople;
drop table #smallTableOfPeople;
edited 13 hours ago
answered 16 hours ago
Randi VertongenRandi Vertongen
1,847316
1,847316
add a comment |
add a comment |
JohnF is a new contributor. Be nice, and check out our Code of Conduct.
JohnF is a new contributor. Be nice, and check out our Code of Conduct.
JohnF is a new contributor. Be nice, and check out our Code of Conduct.
JohnF is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f227468%2fwhy-is-this-faster-and-is-it-safe-to-use-where-first-letter-is-in-the-alphabet%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
1
Can you post the execution plans to pastetheplan.com please?
– George.Palacios
16 hours ago
Query plan added
– JohnF
15 hours ago
4
Small reply to your technical note:
AND LEFT(TRIM(largeTbl.last_name), 1) BETWEEN 'a' AND 'z' COLLATE LATIN1_GENERAL_CI_AI
should do what you want there without requiring you to list all characters and having code that's difficult to read– Erik von Asmuth
12 hours ago
1
Also, how is
TRIM
being used on SQL Server 2016 when it was introduced in 2017? Is the T-SQL wrong, or is this question tagged incorrectly (and it should be SQL Server 2017)?– Solomon Rutzky
10 hours ago
1
@Joshua - no you can't. You could create
dbo.Trim
but that isn't interchangeable withTRIM
– Martin Smith
9 hours ago