What is the recommended way to join junction tables for efficient ordering/pagination?
Summary: I have a simple database schema but even with just a few 10's of thousands of records the performance on basic queries is already becoming a problem.
Database: PostgreSQL 9.6
Simplified schema:
CREATE TABLE article (
id bigint PRIMARY KEY,
title text NOT NULL,
score int NOT NULL
);
CREATE TABLE tag (
id bigint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE article_tag (
article_id bigint NOT NULL REFERENCES article (id),
tag_id bigint NOT NULL REFERENCES tag (id),
PRIMARY KEY (article_id, tag_id)
);
CREATE INDEX ON article (score);
Production data info:
All tables are read/write. Low write volume, only a new record every couple minutes or so.
Approximate record counts:
- ~66K articles
- ~63K tags
- ~147K article_tags
Average of 5 tags per article.
Question: I want to create a view article_tags
which includes an array of tags for every article record, can be ordered by article.score
and paginated with or without additional filtering.
In my first attempt I was surprised to see that the query took ~350 ms to execute and wasn't using the indexes. In subsequent attempts I was able to get it down to ~5 ms but I don't understand what is going on. I would expect all these queries to take the same amount of time. What crucial concept am I missing here?
Attempts (SQL Fiddles):
- multi-table joins (~350 ms), (~5 ms if ordered by article.id!) -- seemed like the most natural solution
- subquery join (~300 ms) -- also seemed like a natural solution
- limited subquery join (~5 ms) -- super awkward, can't be used for view
- lateral join (~5 ms) -- is this really what I should be using? seems like a misuse of lateral
- ...something else?
postgresql performance join postgresql-performance paging
|
show 1 more comment
Summary: I have a simple database schema but even with just a few 10's of thousands of records the performance on basic queries is already becoming a problem.
Database: PostgreSQL 9.6
Simplified schema:
CREATE TABLE article (
id bigint PRIMARY KEY,
title text NOT NULL,
score int NOT NULL
);
CREATE TABLE tag (
id bigint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE article_tag (
article_id bigint NOT NULL REFERENCES article (id),
tag_id bigint NOT NULL REFERENCES tag (id),
PRIMARY KEY (article_id, tag_id)
);
CREATE INDEX ON article (score);
Production data info:
All tables are read/write. Low write volume, only a new record every couple minutes or so.
Approximate record counts:
- ~66K articles
- ~63K tags
- ~147K article_tags
Average of 5 tags per article.
Question: I want to create a view article_tags
which includes an array of tags for every article record, can be ordered by article.score
and paginated with or without additional filtering.
In my first attempt I was surprised to see that the query took ~350 ms to execute and wasn't using the indexes. In subsequent attempts I was able to get it down to ~5 ms but I don't understand what is going on. I would expect all these queries to take the same amount of time. What crucial concept am I missing here?
Attempts (SQL Fiddles):
- multi-table joins (~350 ms), (~5 ms if ordered by article.id!) -- seemed like the most natural solution
- subquery join (~300 ms) -- also seemed like a natural solution
- limited subquery join (~5 ms) -- super awkward, can't be used for view
- lateral join (~5 ms) -- is this really what I should be using? seems like a misuse of lateral
- ...something else?
postgresql performance join postgresql-performance paging
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from tablearticle
(paginated), or a few selected rows - filtered how exactly?
– Erwin Brandstetter
Apr 29 '18 at 12:31
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54
|
show 1 more comment
Summary: I have a simple database schema but even with just a few 10's of thousands of records the performance on basic queries is already becoming a problem.
Database: PostgreSQL 9.6
Simplified schema:
CREATE TABLE article (
id bigint PRIMARY KEY,
title text NOT NULL,
score int NOT NULL
);
CREATE TABLE tag (
id bigint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE article_tag (
article_id bigint NOT NULL REFERENCES article (id),
tag_id bigint NOT NULL REFERENCES tag (id),
PRIMARY KEY (article_id, tag_id)
);
CREATE INDEX ON article (score);
Production data info:
All tables are read/write. Low write volume, only a new record every couple minutes or so.
Approximate record counts:
- ~66K articles
- ~63K tags
- ~147K article_tags
Average of 5 tags per article.
Question: I want to create a view article_tags
which includes an array of tags for every article record, can be ordered by article.score
and paginated with or without additional filtering.
In my first attempt I was surprised to see that the query took ~350 ms to execute and wasn't using the indexes. In subsequent attempts I was able to get it down to ~5 ms but I don't understand what is going on. I would expect all these queries to take the same amount of time. What crucial concept am I missing here?
Attempts (SQL Fiddles):
- multi-table joins (~350 ms), (~5 ms if ordered by article.id!) -- seemed like the most natural solution
- subquery join (~300 ms) -- also seemed like a natural solution
- limited subquery join (~5 ms) -- super awkward, can't be used for view
- lateral join (~5 ms) -- is this really what I should be using? seems like a misuse of lateral
- ...something else?
postgresql performance join postgresql-performance paging
Summary: I have a simple database schema but even with just a few 10's of thousands of records the performance on basic queries is already becoming a problem.
Database: PostgreSQL 9.6
Simplified schema:
CREATE TABLE article (
id bigint PRIMARY KEY,
title text NOT NULL,
score int NOT NULL
);
CREATE TABLE tag (
id bigint PRIMARY KEY,
name text NOT NULL
);
CREATE TABLE article_tag (
article_id bigint NOT NULL REFERENCES article (id),
tag_id bigint NOT NULL REFERENCES tag (id),
PRIMARY KEY (article_id, tag_id)
);
CREATE INDEX ON article (score);
Production data info:
All tables are read/write. Low write volume, only a new record every couple minutes or so.
Approximate record counts:
- ~66K articles
- ~63K tags
- ~147K article_tags
Average of 5 tags per article.
Question: I want to create a view article_tags
which includes an array of tags for every article record, can be ordered by article.score
and paginated with or without additional filtering.
In my first attempt I was surprised to see that the query took ~350 ms to execute and wasn't using the indexes. In subsequent attempts I was able to get it down to ~5 ms but I don't understand what is going on. I would expect all these queries to take the same amount of time. What crucial concept am I missing here?
Attempts (SQL Fiddles):
- multi-table joins (~350 ms), (~5 ms if ordered by article.id!) -- seemed like the most natural solution
- subquery join (~300 ms) -- also seemed like a natural solution
- limited subquery join (~5 ms) -- super awkward, can't be used for view
- lateral join (~5 ms) -- is this really what I should be using? seems like a misuse of lateral
- ...something else?
postgresql performance join postgresql-performance paging
postgresql performance join postgresql-performance paging
edited Apr 30 '18 at 17:44
Lucifer Sam
asked Apr 28 '18 at 22:16
Lucifer SamLucifer Sam
1235
1235
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from tablearticle
(paginated), or a few selected rows - filtered how exactly?
– Erwin Brandstetter
Apr 29 '18 at 12:31
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54
|
show 1 more comment
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from tablearticle
(paginated), or a few selected rows - filtered how exactly?
– Erwin Brandstetter
Apr 29 '18 at 12:31
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from table
article
(paginated), or a few selected rows - filtered how exactly?– Erwin Brandstetter
Apr 29 '18 at 12:31
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from table
article
(paginated), or a few selected rows - filtered how exactly?– Erwin Brandstetter
Apr 29 '18 at 12:31
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54
|
show 1 more comment
1 Answer
1
active
oldest
votes
Pagination
For pagination, LIMIT
(and OFFSET
) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10
only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET
, no matter which query you choose.
If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW
with an added row number, plus index on that. And all your queries select rows by row numbers.
Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY
every N minutes may be acceptable).LIMIT
/ OFFSET
is not going to work properly at all since "the next page" is a moving target there, and LIMIT
/ OFFSET
cannot cope with that. The best technique depends on undisclosed information.
Related:
- Database engine with counting b-trees for efficient paging support
- Optimize query with OFFSET on large table
Index
Your indexes generally look good. But your comment indicates that table tag
has many rows. Typically, there is very little write load on a table like tag
, which is perfect for index-only support. So add a multicolumn ("covering") index:
CREATE INDEX ON tag(id, name);
Related:
- Can Postgres use an index-only scan for this query with joined tables?
Just the top N rows
If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article
before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:
Use an ARRAY
constructor for the LATERAL
variant:
SELECT a.id, a.title, a.score, tags.names
FROM article a
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
)
) AS tags(names) ON true
ORDER BY a.score DESC
LIMIT 10;
The LATERAL
subquery assembles tags for a single article_id
at a time, so GROUP BY article_id
is redundant, as well as the join condition ON tags.article_id = article.id
, and a basic ARRAY
constructor is cheaper than array_agg(tag.name)
for the remaining simple case.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Or use a lowly correlated subquery, typically even faster, yet:
SELECT a.id, a.title, a.score
, ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
) AS names
FROM article a
ORDER BY a.score DESC
LIMIT 10;
db<>fiddle here
SQL Fiddle
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing thatLIMIT
andOFFSET
are NOT filtering the result set in the same way thatWHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.
– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy betweenLIMIT
/OFFSET
execution times when ordering byid
andscore
in my first attempt? Even if I update all the scores to equal the ids, paginating by theid
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makesLIMIT
/OFFSET
paging more efficient when ordering by them?
– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If youGROUP BY
,ORDER BY
andJOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a smallLIMIT
. TheLATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)
– Erwin Brandstetter
May 1 '18 at 0:24
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f205268%2fwhat-is-the-recommended-way-to-join-junction-tables-for-efficient-ordering-pagin%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Pagination
For pagination, LIMIT
(and OFFSET
) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10
only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET
, no matter which query you choose.
If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW
with an added row number, plus index on that. And all your queries select rows by row numbers.
Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY
every N minutes may be acceptable).LIMIT
/ OFFSET
is not going to work properly at all since "the next page" is a moving target there, and LIMIT
/ OFFSET
cannot cope with that. The best technique depends on undisclosed information.
Related:
- Database engine with counting b-trees for efficient paging support
- Optimize query with OFFSET on large table
Index
Your indexes generally look good. But your comment indicates that table tag
has many rows. Typically, there is very little write load on a table like tag
, which is perfect for index-only support. So add a multicolumn ("covering") index:
CREATE INDEX ON tag(id, name);
Related:
- Can Postgres use an index-only scan for this query with joined tables?
Just the top N rows
If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article
before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:
Use an ARRAY
constructor for the LATERAL
variant:
SELECT a.id, a.title, a.score, tags.names
FROM article a
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
)
) AS tags(names) ON true
ORDER BY a.score DESC
LIMIT 10;
The LATERAL
subquery assembles tags for a single article_id
at a time, so GROUP BY article_id
is redundant, as well as the join condition ON tags.article_id = article.id
, and a basic ARRAY
constructor is cheaper than array_agg(tag.name)
for the remaining simple case.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Or use a lowly correlated subquery, typically even faster, yet:
SELECT a.id, a.title, a.score
, ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
) AS names
FROM article a
ORDER BY a.score DESC
LIMIT 10;
db<>fiddle here
SQL Fiddle
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing thatLIMIT
andOFFSET
are NOT filtering the result set in the same way thatWHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.
– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy betweenLIMIT
/OFFSET
execution times when ordering byid
andscore
in my first attempt? Even if I update all the scores to equal the ids, paginating by theid
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makesLIMIT
/OFFSET
paging more efficient when ordering by them?
– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If youGROUP BY
,ORDER BY
andJOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a smallLIMIT
. TheLATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)
– Erwin Brandstetter
May 1 '18 at 0:24
add a comment |
Pagination
For pagination, LIMIT
(and OFFSET
) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10
only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET
, no matter which query you choose.
If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW
with an added row number, plus index on that. And all your queries select rows by row numbers.
Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY
every N minutes may be acceptable).LIMIT
/ OFFSET
is not going to work properly at all since "the next page" is a moving target there, and LIMIT
/ OFFSET
cannot cope with that. The best technique depends on undisclosed information.
Related:
- Database engine with counting b-trees for efficient paging support
- Optimize query with OFFSET on large table
Index
Your indexes generally look good. But your comment indicates that table tag
has many rows. Typically, there is very little write load on a table like tag
, which is perfect for index-only support. So add a multicolumn ("covering") index:
CREATE INDEX ON tag(id, name);
Related:
- Can Postgres use an index-only scan for this query with joined tables?
Just the top N rows
If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article
before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:
Use an ARRAY
constructor for the LATERAL
variant:
SELECT a.id, a.title, a.score, tags.names
FROM article a
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
)
) AS tags(names) ON true
ORDER BY a.score DESC
LIMIT 10;
The LATERAL
subquery assembles tags for a single article_id
at a time, so GROUP BY article_id
is redundant, as well as the join condition ON tags.article_id = article.id
, and a basic ARRAY
constructor is cheaper than array_agg(tag.name)
for the remaining simple case.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Or use a lowly correlated subquery, typically even faster, yet:
SELECT a.id, a.title, a.score
, ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
) AS names
FROM article a
ORDER BY a.score DESC
LIMIT 10;
db<>fiddle here
SQL Fiddle
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing thatLIMIT
andOFFSET
are NOT filtering the result set in the same way thatWHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.
– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy betweenLIMIT
/OFFSET
execution times when ordering byid
andscore
in my first attempt? Even if I update all the scores to equal the ids, paginating by theid
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makesLIMIT
/OFFSET
paging more efficient when ordering by them?
– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If youGROUP BY
,ORDER BY
andJOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a smallLIMIT
. TheLATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)
– Erwin Brandstetter
May 1 '18 at 0:24
add a comment |
Pagination
For pagination, LIMIT
(and OFFSET
) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10
only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET
, no matter which query you choose.
If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW
with an added row number, plus index on that. And all your queries select rows by row numbers.
Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY
every N minutes may be acceptable).LIMIT
/ OFFSET
is not going to work properly at all since "the next page" is a moving target there, and LIMIT
/ OFFSET
cannot cope with that. The best technique depends on undisclosed information.
Related:
- Database engine with counting b-trees for efficient paging support
- Optimize query with OFFSET on large table
Index
Your indexes generally look good. But your comment indicates that table tag
has many rows. Typically, there is very little write load on a table like tag
, which is perfect for index-only support. So add a multicolumn ("covering") index:
CREATE INDEX ON tag(id, name);
Related:
- Can Postgres use an index-only scan for this query with joined tables?
Just the top N rows
If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article
before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:
Use an ARRAY
constructor for the LATERAL
variant:
SELECT a.id, a.title, a.score, tags.names
FROM article a
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
)
) AS tags(names) ON true
ORDER BY a.score DESC
LIMIT 10;
The LATERAL
subquery assembles tags for a single article_id
at a time, so GROUP BY article_id
is redundant, as well as the join condition ON tags.article_id = article.id
, and a basic ARRAY
constructor is cheaper than array_agg(tag.name)
for the remaining simple case.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Or use a lowly correlated subquery, typically even faster, yet:
SELECT a.id, a.title, a.score
, ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
) AS names
FROM article a
ORDER BY a.score DESC
LIMIT 10;
db<>fiddle here
SQL Fiddle
Pagination
For pagination, LIMIT
(and OFFSET
) are simple, but typically inefficient tools for bigger tables. Your tests with LIMIT 10
only show the tip of the iceberg. Performance is going to degrade with a growing OFFSET
, no matter which query you choose.
If you have no or little concurrent write access, the superior solution is a MATERIALIZED VIEW
with an added row number, plus index on that. And all your queries select rows by row numbers.
Under concurrent write load, such a MV is outdated quickly (But a compromise like refreshing the MV CONCURRENTLY
every N minutes may be acceptable).LIMIT
/ OFFSET
is not going to work properly at all since "the next page" is a moving target there, and LIMIT
/ OFFSET
cannot cope with that. The best technique depends on undisclosed information.
Related:
- Database engine with counting b-trees for efficient paging support
- Optimize query with OFFSET on large table
Index
Your indexes generally look good. But your comment indicates that table tag
has many rows. Typically, there is very little write load on a table like tag
, which is perfect for index-only support. So add a multicolumn ("covering") index:
CREATE INDEX ON tag(id, name);
Related:
- Can Postgres use an index-only scan for this query with joined tables?
Just the top N rows
If you don't actually need more pages (which isn't strictly "paging"), then any query style is good that reduces qualifying rows from article
before retrieving details from the related tables (expensively). Your "limited subquery" (3.) and "lateral join" (4.) solutions are good. But you can do better:
Use an ARRAY
constructor for the LATERAL
variant:
SELECT a.id, a.title, a.score, tags.names
FROM article a
LEFT JOIN LATERAL (
SELECT ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
)
) AS tags(names) ON true
ORDER BY a.score DESC
LIMIT 10;
The LATERAL
subquery assembles tags for a single article_id
at a time, so GROUP BY article_id
is redundant, as well as the join condition ON tags.article_id = article.id
, and a basic ARRAY
constructor is cheaper than array_agg(tag.name)
for the remaining simple case.
- Why is array_agg() slower than the non-aggregate ARRAY() constructor?
Or use a lowly correlated subquery, typically even faster, yet:
SELECT a.id, a.title, a.score
, ARRAY (
SELECT t.name
FROM article_tag a_t
JOIN tag t ON t.id = a_t.tag_id
WHERE a_t.article_id = a.id
-- ORDER BY t.id -- optionally sort array elements
) AS names
FROM article a
ORDER BY a.score DESC
LIMIT 10;
db<>fiddle here
SQL Fiddle
edited 2 mins ago
answered Apr 29 '18 at 13:30
Erwin BrandstetterErwin Brandstetter
91k9170283
91k9170283
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing thatLIMIT
andOFFSET
are NOT filtering the result set in the same way thatWHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.
– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy betweenLIMIT
/OFFSET
execution times when ordering byid
andscore
in my first attempt? Even if I update all the scores to equal the ids, paginating by theid
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makesLIMIT
/OFFSET
paging more efficient when ordering by them?
– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If youGROUP BY
,ORDER BY
andJOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a smallLIMIT
. TheLATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)
– Erwin Brandstetter
May 1 '18 at 0:24
add a comment |
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing thatLIMIT
andOFFSET
are NOT filtering the result set in the same way thatWHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.
– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy betweenLIMIT
/OFFSET
execution times when ordering byid
andscore
in my first attempt? Even if I update all the scores to equal the ids, paginating by theid
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makesLIMIT
/OFFSET
paging more efficient when ordering by them?
– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If youGROUP BY
,ORDER BY
andJOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a smallLIMIT
. TheLATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)
– Erwin Brandstetter
May 1 '18 at 0:24
1
1
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing that
LIMIT
and OFFSET
are NOT filtering the result set in the same way that WHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.– Lucifer Sam
Apr 30 '18 at 17:57
Thank you so much for this exceptional answer! The "ah ha" moment for me is realizing that
LIMIT
and OFFSET
are NOT filtering the result set in the same way that WHERE
does even if the column that I am ordering by has an index. Everything makes a whole lot more sense now including the alternative query methods you provided.– Lucifer Sam
Apr 30 '18 at 17:57
Also, perhaps this is suited for another question but why is there such a discrepancy between
LIMIT
/OFFSET
execution times when ordering by id
and score
in my first attempt? Even if I update all the scores to equal the ids, paginating by the id
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makes LIMIT
/OFFSET
paging more efficient when ordering by them?– Lucifer Sam
Apr 30 '18 at 18:02
Also, perhaps this is suited for another question but why is there such a discrepancy between
LIMIT
/OFFSET
execution times when ordering by id
and score
in my first attempt? Even if I update all the scores to equal the ids, paginating by the id
column is faster by at least a factor of two despite the fact that both columns are indexed integers with identical values. Is there something special about primary key columns that makes LIMIT
/OFFSET
paging more efficient when ordering by them?– Lucifer Sam
Apr 30 '18 at 18:02
@LuciferSam: If you
GROUP BY
, ORDER BY
and JOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a small LIMIT
. The LATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684 EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)– Erwin Brandstetter
May 1 '18 at 0:24
@LuciferSam: If you
GROUP BY
, ORDER BY
and JOIN
using the same column (id
), Postgres can typically use a simpler, cheaper query plan. Even more so for a small LIMIT
. The LATERAL
variants should not exhibit much of the same difference. Related question: dba.stackexchange.com/q/18300/3684 EXPLAIN
(with various options) exposes plan and performance details for each query. (And yes, this is stuff for another question. Comments are not the place.)– Erwin Brandstetter
May 1 '18 at 0:24
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f205268%2fwhat-is-the-recommended-way-to-join-junction-tables-for-efficient-ordering-pagin%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
In your fiddle they all take around the same time (~4ms). Same result if I execute them locally (all around ~12ms). Did you really get such significantly different results with the same test data you provided?
– sticky bit
Apr 29 '18 at 4:06
@stickybit No the test data is there just to run the queries and show the execution plan. The benchmark numbers are from my actual data set which consists of about ~66K articles, ~63K tags and ~147K article_tags.
– Lucifer Sam
Apr 29 '18 at 4:33
The plan might change with different data, so the sample data provided might not show the right thing.
– sticky bit
Apr 29 '18 at 4:38
You provided good information, but still forgot your version of Postgres. Also: is the table read only? Or how much concurrent write activity? And how up-to-date do results have to be? How many distinct tags and how many tags per article? Are the numbers in your comment typical? If so, edit the question to provide this essential information there. Is this for retrieving all rows from table
article
(paginated), or a few selected rows - filtered how exactly?– Erwin Brandstetter
Apr 29 '18 at 12:31
@stickybit Good point. I just compared and the plans are similar for each query but with a larger data set a couple of the seq scans are replaced with index scans.
– Lucifer Sam
Apr 29 '18 at 23:54