Optimizing “random” choice in PostgreSQL












4















I have the following table



CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)


from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC clause in the CTE). I'm using the following CTE to achieve this:



WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10


EXPLAIN ANALYZE originally gave me the following output:



Limit  (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms


I tried adding



CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);


and ended up with the following execution plan:



Limit  (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms


which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles table has currently about 400,000 records.



PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)










share|improve this question
















bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • What does "fictional" mean?

    – ypercubeᵀᴹ
    Mar 7 '16 at 12:36






  • 1





    I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

    – a_horse_with_no_name
    Mar 7 '16 at 12:36











  • "Fictional" means that the names are fictional (made up for this example).

    – Tony E. Stark
    Mar 7 '16 at 12:46











  • @dezso I've edited the question to include the output of EXPLAIN ANALYZE.

    – Tony E. Stark
    Mar 7 '16 at 12:52






  • 2





    See here and here and here

    – a_horse_with_no_name
    Mar 7 '16 at 12:56
















4















I have the following table



CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)


from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC clause in the CTE). I'm using the following CTE to achieve this:



WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10


EXPLAIN ANALYZE originally gave me the following output:



Limit  (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms


I tried adding



CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);


and ended up with the following execution plan:



Limit  (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms


which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles table has currently about 400,000 records.



PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)










share|improve this question
















bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • What does "fictional" mean?

    – ypercubeᵀᴹ
    Mar 7 '16 at 12:36






  • 1





    I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

    – a_horse_with_no_name
    Mar 7 '16 at 12:36











  • "Fictional" means that the names are fictional (made up for this example).

    – Tony E. Stark
    Mar 7 '16 at 12:46











  • @dezso I've edited the question to include the output of EXPLAIN ANALYZE.

    – Tony E. Stark
    Mar 7 '16 at 12:52






  • 2





    See here and here and here

    – a_horse_with_no_name
    Mar 7 '16 at 12:56














4












4








4








I have the following table



CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)


from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC clause in the CTE). I'm using the following CTE to achieve this:



WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10


EXPLAIN ANALYZE originally gave me the following output:



Limit  (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms


I tried adding



CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);


and ended up with the following execution plan:



Limit  (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms


which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles table has currently about 400,000 records.



PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)










share|improve this question
















I have the following table



CREATE TABLE article (
id bigserial PRIMARY KEY,
text TEXT NOT NULL,
category_id bigint REFERENCES categories (id),
status TEXT NOT NULL,
locale TEXT NOT NULL,
source TEXT NOT NULL
)


from where I want to select, say, 10 random records from as many distinct categories as possible (hence the ORDER BY rn ASC clause in the CTE). I'm using the following CTE to achieve this:



WITH "partitioned_articles" AS
(
SELECT
*,
ROW_NUMBER() OVER (PARTITION BY "articles"."category_id"
ORDER BY RANDOM() ASC) AS "rn"
FROM
"articles"
WHERE
"articles"."status" = 'PUBLISHED' AND
LOWER("articles"."locale") = LOWER('en_US') AND
"articles"."source" = 'WIKIPEDIA'
ORDER BY
rn ASC
)
SELECT
*
FROM
"partitioned_articles"
LIMIT
10


EXPLAIN ANALYZE originally gave me the following output:



Limit  (cost=45587.17..45587.37 rows=10 width=473) (actual time=795.552..795.557 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=45586.04..45587.17 rows=452 width=306) (actual time=795.545..795.547 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21984kB
-> WindowAgg (cost=45555.93..45566.10 rows=452 width=306) (actual time=574.822..673.524 rows=89190 loops=1)
-> Sort (cost=45555.93..45557.06 rows=452 width=306) (actual time=574.808..623.847 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Seq Scan on articles (cost=0.00..45536.00 rows=452 width=306) (actual time=0.190..364.287 rows=89190 loops=1)
Filter: ((status = 'PUBLISHED'::text) AND (source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 310810
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=795.550..795.554 rows=10 loops=1)
Planning time: 0.462 ms
Execution time: 805.519 ms


I tried adding



CREATE INDEX articles__locale ON articles (locale);
CREATE INDEX articles__source ON articles (source);
CREATE INDEX articles__status ON articles (status);


and ended up with the following execution plan:



Limit  (cost=43393.70..43393.90 rows=10 width=473) (actual time=686.982..686.991 rows=10 loops=1)
CTE partitioned_articles
-> Sort (cost=43392.57..43393.70 rows=452 width=306) (actual time=686.976..686.978 rows=10 loops=1)
Sort Key: (row_number() OVER (?))
Sort Method: external merge Disk: 21976kB
-> WindowAgg (cost=43362.47..43372.64 rows=452 width=306) (actual time=462.080..562.038 rows=89190 loops=1)
-> Sort (cost=43362.47..43363.60 rows=452 width=306) (actual time=462.064..511.409 rows=89190 loops=1)
Sort Key: articles.category_id, (random())
Sort Method: external merge Disk: 20368kB
-> Bitmap Heap Scan on articles (cost=3102.54..43342.54 rows=452 width=306) (actual time=35.149..216.145 rows=89190 loops=1)
Recheck Cond: (status = 'PUBLISHED'::text)
Filter: ((source = 'WIKIPEDIA'::text) AND (lower(locale) = 'en_us'::text))
Rows Removed by Filter: 44388
Heap Blocks: exact=12897
-> Bitmap Index Scan on articles__status (cost=0.00..3102.42 rows=135200 width=0) (actual time=30.421..30.421 rows=133578 loops=1)
Index Cond: (status = 'PUBLISHED'::text)
-> CTE Scan on partitioned_articles (cost=0.00..9.04 rows=452 width=473) (actual time=686.981..686.988 rows=10 loops=1)
Planning time: 0.621 ms
Execution time: 696.945 ms


which, from what I understand, is only marginally better. Is there anything else I can do to further optimize this query? I'm using PostgreSQL 9.4 (planning on migrating to 9.5 as soon as possible, but unfortunately it is still not an option) and the articles table has currently about 400,000 records.



PS - These names have been made up for the example, I'm not really scrapping anything from Wikipedia or from anywhere else ;-)







postgresql postgresql-9.4 execution-plan cte random






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Mar 7 '16 at 15:30









Vérace

16k33350




16k33350










asked Mar 7 '16 at 12:25









Tony E. StarkTony E. Stark

1115




1115





bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 6 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • What does "fictional" mean?

    – ypercubeᵀᴹ
    Mar 7 '16 at 12:36






  • 1





    I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

    – a_horse_with_no_name
    Mar 7 '16 at 12:36











  • "Fictional" means that the names are fictional (made up for this example).

    – Tony E. Stark
    Mar 7 '16 at 12:46











  • @dezso I've edited the question to include the output of EXPLAIN ANALYZE.

    – Tony E. Stark
    Mar 7 '16 at 12:52






  • 2





    See here and here and here

    – a_horse_with_no_name
    Mar 7 '16 at 12:56



















  • What does "fictional" mean?

    – ypercubeᵀᴹ
    Mar 7 '16 at 12:36






  • 1





    I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

    – a_horse_with_no_name
    Mar 7 '16 at 12:36











  • "Fictional" means that the names are fictional (made up for this example).

    – Tony E. Stark
    Mar 7 '16 at 12:46











  • @dezso I've edited the question to include the output of EXPLAIN ANALYZE.

    – Tony E. Stark
    Mar 7 '16 at 12:52






  • 2





    See here and here and here

    – a_horse_with_no_name
    Mar 7 '16 at 12:56

















What does "fictional" mean?

– ypercubeᵀᴹ
Mar 7 '16 at 12:36





What does "fictional" mean?

– ypercubeᵀᴹ
Mar 7 '16 at 12:36




1




1





I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

– a_horse_with_no_name
Mar 7 '16 at 12:36





I think using 9.5's tablesample could speed this up. Especially together with the tsm_system_rows extension: postgresql.org/docs/current/static/tsm-system-rows.html

– a_horse_with_no_name
Mar 7 '16 at 12:36













"Fictional" means that the names are fictional (made up for this example).

– Tony E. Stark
Mar 7 '16 at 12:46





"Fictional" means that the names are fictional (made up for this example).

– Tony E. Stark
Mar 7 '16 at 12:46













@dezso I've edited the question to include the output of EXPLAIN ANALYZE.

– Tony E. Stark
Mar 7 '16 at 12:52





@dezso I've edited the question to include the output of EXPLAIN ANALYZE.

– Tony E. Stark
Mar 7 '16 at 12:52




2




2





See here and here and here

– a_horse_with_no_name
Mar 7 '16 at 12:56





See here and here and here

– a_horse_with_no_name
Mar 7 '16 at 12:56










1 Answer
1






active

oldest

votes


















0














Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.



[1] though for small data sets it might not actually touch disk



Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).



If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).



A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.






share|improve this answer


























  • Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

    – David Spillett
    Mar 7 '16 at 17:28











Your Answer








StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});


}
});














draft saved

draft discarded


















StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f131493%2foptimizing-random-choice-in-postgresql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown

























1 Answer
1






active

oldest

votes








1 Answer
1






active

oldest

votes









active

oldest

votes






active

oldest

votes









0














Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.



[1] though for small data sets it might not actually touch disk



Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).



If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).



A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.






share|improve this answer


























  • Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

    – David Spillett
    Mar 7 '16 at 17:28
















0














Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.



[1] though for small data sets it might not actually touch disk



Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).



If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).



A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.






share|improve this answer


























  • Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

    – David Spillett
    Mar 7 '16 at 17:28














0












0








0







Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.



[1] though for small data sets it might not actually touch disk



Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).



If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).



A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.






share|improve this answer















Adding indexes isn't going to help much: ordering by a random value give the engine no choice but to scan the whole table and sort on disk[1], whether the ordering is done directly in an ORDER BY clause or via a window function.



[1] though for small data sets it might not actually touch disk



Without sampling functions being available there is very little you can do in a simple view. It might be possible for tables within integer keys to hack something efficient using a recursive CTE in MS SQL Server or similar, but CTEs are optimisation fences in Postgres so I doubt this would help here (without actually trying it, my gut reaction would be to assume it would perform far worse).



If you can use a stored procedure or business logic in another layer of your application it would be more efficient to read the highest and lowest IDs, then loop picking a random value until you have 10 that exist and select out those rows. This would not be suitable if your data has many large gaps in the IDs (i.e. if the sequence is more gap than used number).



A final option that may reduce the amount of IO is to select just that in the CTE ordering by random, then join to the base table to get the rest of the row. This way you will still scan the index and potentially write it to disk for sorting, but you wan't scan and write the whole table.







share|improve this answer














share|improve this answer



share|improve this answer








edited Mar 7 '16 at 17:25

























answered Mar 7 '16 at 13:47









David SpillettDavid Spillett

22.2k23267




22.2k23267













  • Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

    – David Spillett
    Mar 7 '16 at 17:28



















  • Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

    – David Spillett
    Mar 7 '16 at 17:28

















Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

– David Spillett
Mar 7 '16 at 17:28





Good catch, I had my SQL Server head on. Postgres' use of the word cluster relating to indexes is quite different. I've edited the response to remove the irrelevant detail.

– David Spillett
Mar 7 '16 at 17:28


















draft saved

draft discarded




















































Thanks for contributing an answer to Database Administrators Stack Exchange!


  • Please be sure to answer the question. Provide details and share your research!

But avoid



  • Asking for help, clarification, or responding to other answers.

  • Making statements based on opinion; back them up with references or personal experience.


To learn more, see our tips on writing great answers.




draft saved


draft discarded














StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f131493%2foptimizing-random-choice-in-postgresql%23new-answer', 'question_page');
}
);

Post as a guest















Required, but never shown





















































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown

































Required, but never shown














Required, but never shown












Required, but never shown







Required, but never shown







Popular posts from this blog

SQL Server 17 - Attemping to backup to remote NAS but Access is denied

Always On Availability groups resolving state after failover - Remote harden of transaction...

Restoring from pg_dump with foreign key constraints