PostgreSQL - Working with array of thousands of elements
I'm looking to select rows based on whether a column is contained in a large list of values that I pass as an integer array.
Here's the query I currently use:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
The table is structured as such:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
I came up with this index after trying different ones and running EXPLAIN
on the query. This one was the most efficient for both querying and sorting. Here is the explain analyze of the query:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer)) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
The issue is that the int array can contain up to 15,000 elements or so and the query gets quite slow in this case (about 800ms on my laptop, a recent Dell XPS).
I thought passing the int array as a parameter could be slow so, and considering the list of ids can be stored beforehand in the database I tried doing this. I stored them in an array in another table and used item_id = ANY (SELECT UNNEST(item_ids) FROM ...)
, which was slower than my current approach. I also tried storing them row by row and using item_id IN (SELECT item_id FROM ...)
, which was even slower, even with only the rows relevant to my test case in the table.
Is there a better way of doing this?
Update: following Evan's comments, another approach I tried: each item is part of several groups, so instead of passing group's item ids, I tried adding the group ids in mytable:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
New query ($1 is the targeted group id):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Explain analyze:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
There might be room for improvement with indexes but I'm having a hard time understanding how postgres uses them, so I'm not sure what to change.
postgresql postgresql-performance
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
|
show 1 more comment
I'm looking to select rows based on whether a column is contained in a large list of values that I pass as an integer array.
Here's the query I currently use:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
The table is structured as such:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
I came up with this index after trying different ones and running EXPLAIN
on the query. This one was the most efficient for both querying and sorting. Here is the explain analyze of the query:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer)) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
The issue is that the int array can contain up to 15,000 elements or so and the query gets quite slow in this case (about 800ms on my laptop, a recent Dell XPS).
I thought passing the int array as a parameter could be slow so, and considering the list of ids can be stored beforehand in the database I tried doing this. I stored them in an array in another table and used item_id = ANY (SELECT UNNEST(item_ids) FROM ...)
, which was slower than my current approach. I also tried storing them row by row and using item_id IN (SELECT item_id FROM ...)
, which was even slower, even with only the rows relevant to my test case in the table.
Is there a better way of doing this?
Update: following Evan's comments, another approach I tried: each item is part of several groups, so instead of passing group's item ids, I tried adding the group ids in mytable:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
New query ($1 is the targeted group id):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Explain analyze:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
There might be room for improvement with indexes but I'm having a hard time understanding how postgres uses them, so I'm not sure what to change.
postgresql postgresql-performance
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
About 12 million rows inmytable
, with about 500k differentitem_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess theitem_id
+start_date
+name
(field not shown here) could constitute some kind of key.
– Jukurrpa
Apr 2 '18 at 19:22
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11
|
show 1 more comment
I'm looking to select rows based on whether a column is contained in a large list of values that I pass as an integer array.
Here's the query I currently use:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
The table is structured as such:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
I came up with this index after trying different ones and running EXPLAIN
on the query. This one was the most efficient for both querying and sorting. Here is the explain analyze of the query:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer)) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
The issue is that the int array can contain up to 15,000 elements or so and the query gets quite slow in this case (about 800ms on my laptop, a recent Dell XPS).
I thought passing the int array as a parameter could be slow so, and considering the list of ids can be stored beforehand in the database I tried doing this. I stored them in an array in another table and used item_id = ANY (SELECT UNNEST(item_ids) FROM ...)
, which was slower than my current approach. I also tried storing them row by row and using item_id IN (SELECT item_id FROM ...)
, which was even slower, even with only the rows relevant to my test case in the table.
Is there a better way of doing this?
Update: following Evan's comments, another approach I tried: each item is part of several groups, so instead of passing group's item ids, I tried adding the group ids in mytable:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
New query ($1 is the targeted group id):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Explain analyze:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
There might be room for improvement with indexes but I'm having a hard time understanding how postgres uses them, so I'm not sure what to change.
postgresql postgresql-performance
I'm looking to select rows based on whether a column is contained in a large list of values that I pass as an integer array.
Here's the query I currently use:
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
item_id = ANY ($1) -- Integer array
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
The table is structured as such:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
...
I came up with this index after trying different ones and running EXPLAIN
on the query. This one was the most efficient for both querying and sorting. Here is the explain analyze of the query:
Subquery Scan on x (cost=0.56..368945.41 rows=302230 width=73) (actual time=0.021..276.476 rows=168395 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 90275
-> WindowAgg (cost=0.56..357611.80 rows=906689 width=73) (actual time=0.019..248.267 rows=258670 loops=1)
-> Index Scan using idx_dtr_query on mytable (cost=0.56..339478.02 rows=906689 width=73) (actual time=0.013..130.362 rows=258670 loops=1)
Index Cond: ((item_id = ANY ('{/* 15,000 integers */}'::integer)) AND (end_date > '2018-03-30 12:08:00'::timestamp without time zone))
Planning time: 30.349 ms
Execution time: 284.619 ms
The issue is that the int array can contain up to 15,000 elements or so and the query gets quite slow in this case (about 800ms on my laptop, a recent Dell XPS).
I thought passing the int array as a parameter could be slow so, and considering the list of ids can be stored beforehand in the database I tried doing this. I stored them in an array in another table and used item_id = ANY (SELECT UNNEST(item_ids) FROM ...)
, which was slower than my current approach. I also tried storing them row by row and using item_id IN (SELECT item_id FROM ...)
, which was even slower, even with only the rows relevant to my test case in the table.
Is there a better way of doing this?
Update: following Evan's comments, another approach I tried: each item is part of several groups, so instead of passing group's item ids, I tried adding the group ids in mytable:
Column | Type | Collation | Nullable | Default
---------------+-----------------------------+-----------+----------+---------
item_id | integer | | not null |
allowed | boolean | | not null |
start_date | timestamp without time zone | | not null |
end_date | timestamp without time zone | | not null |
group_ids | integer | | not null |
...
Indexes:
"idx_dtr_query" btree (item_id, start_date, allowed, end_date)
"idx_dtr_group_ids" gin (group_ids)
...
New query ($1 is the targeted group id):
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
WHERE
$1 = ANY (group_ids)
AND end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12
Explain analyze:
Subquery Scan on x (cost=123356.60..137112.58 rows=131009 width=74) (actual time=811.337..1087.880 rows=172023 loops=1)
Filter: (x.r <= 12)
Rows Removed by Filter: 219726
-> WindowAgg (cost=123356.60..132199.73 rows=393028 width=74) (actual time=811.330..1040.121 rows=391749 loops=1)
-> Sort (cost=123356.60..124339.17 rows=393028 width=74) (actual time=811.311..868.127 rows=391749 loops=1)
Sort Key: item_id, start_date, allowed
Sort Method: external sort Disk: 29176kB
-> Seq Scan on mytable (cost=0.00..69370.90 rows=393028 width=74) (actual time=0.105..464.126 rows=391749 loops=1)
Filter: ((end_date > '2018-04-06 12:00:00'::timestamp without time zone) AND (2928 = ANY (group_ids)))
Rows Removed by Filter: 1482567
Planning time: 0.756 ms
Execution time: 1098.348 ms
There might be room for improvement with indexes but I'm having a hard time understanding how postgres uses them, so I'm not sure what to change.
postgresql postgresql-performance
postgresql postgresql-performance
edited Apr 6 '18 at 19:25
Jukurrpa
asked Mar 30 '18 at 22:10
JukurrpaJukurrpa
1312
1312
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 8 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
About 12 million rows inmytable
, with about 500k differentitem_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess theitem_id
+start_date
+name
(field not shown here) could constitute some kind of key.
– Jukurrpa
Apr 2 '18 at 19:22
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11
|
show 1 more comment
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
About 12 million rows inmytable
, with about 500k differentitem_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess theitem_id
+start_date
+name
(field not shown here) could constitute some kind of key.
– Jukurrpa
Apr 2 '18 at 19:22
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
About 12 million rows in
mytable
, with about 500k different item_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess the item_id
+ start_date
+ name
(field not shown here) could constitute some kind of key.– Jukurrpa
Apr 2 '18 at 19:22
About 12 million rows in
mytable
, with about 500k different item_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess the item_id
+ start_date
+ name
(field not shown here) could constitute some kind of key.– Jukurrpa
Apr 2 '18 at 19:22
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11
|
show 1 more comment
1 Answer
1
active
oldest
votes
Is there a better way of doing this?
Yes, use a temp table. There is nothing wrong with creating an indexed temp table when your query is that insane.
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
But even better than that...
"500k different item_id" ... "int array can contain up to 15,000 elements"
You're selecting 3% of your database individually. I have to wonder if you're not better off creating groups/tags etc in the schema itself. I have never personally had to send 15,000 different IDs into a query.
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f202719%2fpostgresql-working-with-array-of-thousands-of-elements%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Is there a better way of doing this?
Yes, use a temp table. There is nothing wrong with creating an indexed temp table when your query is that insane.
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
But even better than that...
"500k different item_id" ... "int array can contain up to 15,000 elements"
You're selecting 3% of your database individually. I have to wonder if you're not better off creating groups/tags etc in the schema itself. I have never personally had to send 15,000 different IDs into a query.
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
add a comment |
Is there a better way of doing this?
Yes, use a temp table. There is nothing wrong with creating an indexed temp table when your query is that insane.
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
But even better than that...
"500k different item_id" ... "int array can contain up to 15,000 elements"
You're selecting 3% of your database individually. I have to wonder if you're not better off creating groups/tags etc in the schema itself. I have never personally had to send 15,000 different IDs into a query.
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
add a comment |
Is there a better way of doing this?
Yes, use a temp table. There is nothing wrong with creating an indexed temp table when your query is that insane.
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
But even better than that...
"500k different item_id" ... "int array can contain up to 15,000 elements"
You're selecting 3% of your database individually. I have to wonder if you're not better off creating groups/tags etc in the schema itself. I have never personally had to send 15,000 different IDs into a query.
Is there a better way of doing this?
Yes, use a temp table. There is nothing wrong with creating an indexed temp table when your query is that insane.
BEGIN;
CREATE TEMP TABLE myitems ( item_id int PRIMARY KEY );
INSERT INTO myitems(item_id) VALUES (1), (2); -- and on and on
CREATE INDEX ON myitems(item_id);
COMMIT;
ANALYZE myitems;
SELECT item_id, other_stuff, ...
FROM (
SELECT
-- Partitioned row number as we only want N rows per id
ROW_NUMBER() OVER (PARTITION BY item_id ORDER BY start_date) AS r,
item_id, other_stuff, ...
FROM mytable
INNER JOIN myitems USING (item_id)
WHERE end_date > $2
ORDER BY item_id ASC, start_date ASC, allowed ASC
) x
WHERE x.r <= 12;
But even better than that...
"500k different item_id" ... "int array can contain up to 15,000 elements"
You're selecting 3% of your database individually. I have to wonder if you're not better off creating groups/tags etc in the schema itself. I have never personally had to send 15,000 different IDs into a query.
answered Apr 4 '18 at 18:36
Evan CarrollEvan Carroll
32.5k970221
32.5k970221
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
add a comment |
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
Just tried using the temporary table and it's slower, at least in the 15,000 ids case. As for creating groups in the schema itself do you mean a table with the ids I pass as an argument? I tried something like this but the performance was similar or worse than my current approach. I'll update the question with more details
– Jukurrpa
Apr 5 '18 at 18:33
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
No, I mean. If you have 15,000 ids normally you're storing something in the ID, like whether or not the item is a kitchen-product, and rather than storing the group_id that corresponds to "kitchen product", you're trying to find all kitchen products by their ids. (which is bad for every reason) What is that those 15,000 ids represent? Why isn't it stored on the row itself?
– Evan Carroll
Apr 5 '18 at 18:36
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
Each item belongs to multiple groups (usually 15-20 of them), so I tried storing them as an int array in mytable but couldn't figure out how to index this properly. I updated the question with all the details.
– Jukurrpa
Apr 6 '18 at 19:27
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f202719%2fpostgresql-working-with-array-of-thousands-of-elements%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
How many rows in "mytable"? How many different "item_id" values there?
– Nick
Mar 31 '18 at 3:00
Also, shouldn't you have uniqueness constraint (probably not-yet-defined unique index) in on item_id in mytable? ... Edited: oh, I see "PARTITION BY item_id", so this question transforms to "What is the natural, real key for your data? What should form unique index there?"
– Nick
Mar 31 '18 at 3:03
About 12 million rows in
mytable
, with about 500k differentitem_id
. There is no real natural unique key for this table, it's data that's generated automatically for repeating events. I guess theitem_id
+start_date
+name
(field not shown here) could constitute some kind of key.– Jukurrpa
Apr 2 '18 at 19:22
Can you post the execution plan you're getting?
– Colin 't Hart
Apr 4 '18 at 8:53
Sure, added the explain analyze to the question.
– Jukurrpa
Apr 4 '18 at 18:11