Can spatial index help a “range - order by - limit” query












25















Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.



We have the following table with a tree structure (Nested Set model) of words and their frequencies:



lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY


And the query:



SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N


I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.



A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.



But it seems that performance depends a lot on the parameter values.



Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?



Would an R-tree/spatial index help?



Adding indexes, rewriting the query, re-designing the table, there is no limitation.










share|improve this question




















  • 3





    Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

    – Erwin Brandstetter
    Jun 1 '12 at 17:10













  • Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

    – j.p.
    Aug 7 '12 at 7:40











  • @jug: Mostly reads. No connection between the lset,rset and word.

    – ypercubeᵀᴹ
    Aug 7 '12 at 7:42






  • 3





    If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

    – j.p.
    Aug 7 '12 at 14:41
















25















Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.



We have the following table with a tree structure (Nested Set model) of words and their frequencies:



lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY


And the query:



SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N


I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.



A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.



But it seems that performance depends a lot on the parameter values.



Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?



Would an R-tree/spatial index help?



Adding indexes, rewriting the query, re-designing the table, there is no limitation.










share|improve this question




















  • 3





    Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

    – Erwin Brandstetter
    Jun 1 '12 at 17:10













  • Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

    – j.p.
    Aug 7 '12 at 7:40











  • @jug: Mostly reads. No connection between the lset,rset and word.

    – ypercubeᵀᴹ
    Aug 7 '12 at 7:42






  • 3





    If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

    – j.p.
    Aug 7 '12 at 14:41














25












25








25


15






Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.



We have the following table with a tree structure (Nested Set model) of words and their frequencies:



lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY


And the query:



SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N


I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.



A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.



But it seems that performance depends a lot on the parameter values.



Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?



Would an R-tree/spatial index help?



Adding indexes, rewriting the query, re-designing the table, there is no limitation.










share|improve this question
















Asking this question, specifically for Postgres, as it has good supoort for R-tree/spatial indexes.



We have the following table with a tree structure (Nested Set model) of words and their frequencies:



lexikon
-------
_id integer PRIMARY KEY
word text
frequency integer
lset integer UNIQUE KEY
rset integer UNIQUE KEY


And the query:



SELECT word
FROM lexikon
WHERE lset BETWEEN @Low AND @High
ORDER BY frequency DESC
LIMIT @N


I suppose a covering index on (lset, frequency, word) would be useful but I feel it may not perform well if there are too many lset values in the (@High, @Low) range.



A simple index on (frequency DESC) may also be sufficient sometimes, when a search using that index yields early the @N rows that match the range condition.



But it seems that performance depends a lot on the parameter values.



Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?



Would an R-tree/spatial index help?



Adding indexes, rewriting the query, re-designing the table, there is no limitation.







postgresql database-design index query-performance






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited 26 mins ago









Evan Carroll

31.6k966214




31.6k966214










asked May 22 '12 at 21:19









ypercubeᵀᴹypercubeᵀᴹ

74.9k11127208




74.9k11127208








  • 3





    Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

    – Erwin Brandstetter
    Jun 1 '12 at 17:10













  • Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

    – j.p.
    Aug 7 '12 at 7:40











  • @jug: Mostly reads. No connection between the lset,rset and word.

    – ypercubeᵀᴹ
    Aug 7 '12 at 7:42






  • 3





    If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

    – j.p.
    Aug 7 '12 at 14:41














  • 3





    Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

    – Erwin Brandstetter
    Jun 1 '12 at 17:10













  • Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

    – j.p.
    Aug 7 '12 at 7:40











  • @jug: Mostly reads. No connection between the lset,rset and word.

    – ypercubeᵀᴹ
    Aug 7 '12 at 7:42






  • 3





    If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

    – j.p.
    Aug 7 '12 at 14:41








3




3





Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

– Erwin Brandstetter
Jun 1 '12 at 17:10







Covering indexes are introduced with 9.2 (now beta), btw. PostgreSQL people speak of Index-only scans. See this related answer: dba.stackexchange.com/a/7541/3684 and the PostgreSQL Wiki page

– Erwin Brandstetter
Jun 1 '12 at 17:10















Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

– j.p.
Aug 7 '12 at 7:40





Two questions: (1) What kind of usage pattern do you expect for the table? Are there mostly reads or are there frequent updates (especially of the nested set variables)? (2) Is there any connection between the nested set integer variables lset and rset and the text variable word?

– j.p.
Aug 7 '12 at 7:40













@jug: Mostly reads. No connection between the lset,rset and word.

– ypercubeᵀᴹ
Aug 7 '12 at 7:42





@jug: Mostly reads. No connection between the lset,rset and word.

– ypercubeᵀᴹ
Aug 7 '12 at 7:42




3




3





If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

– j.p.
Aug 7 '12 at 14:41





If you had many updates, the nested set model would be a bad choice with respect to performance (if you have access to the book "The art of SQL", take a look at the chapter about hierachic models). But anyway, your main problem is similar to finding the maximum/the highest values (of an independent variable) on an interval, for which it is hard to design an indexing method. To my knowledge, the closest match to the index you need is the knngist module, but you would have to modify it to fit your needs. A spatial index is unlikely to be helpful.

– j.p.
Aug 7 '12 at 14:41










4 Answers
4






active

oldest

votes


















27





+50










You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:



--testbed and lexikon dummy data:



begin;
set role dba;
create role stack;
grant stack to dba;
create schema authorization stack;
set role stack;
--
create table lexikon( _id serial,
word text,
frequency integer,
lset integer,
width_granule integer);
--
insert into lexikon(word, frequency, lset)
select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
--
update lexikon set width_granule=ln(frequency)::integer;
--
create index on lexikon(width_granule, lset);
create index on lexikon(lset);
-- the second index is not used with the function but is added to make the timings 'fair'


granule analysis (mostly for information and tuning):



create table granule as 
select width_granule, count(*) as freq,
min(frequency) as granule_start, max(frequency) as granule_end
from lexikon group by width_granule;
--
select * from granule order by 1;
/*
width_granule | freq | granule_start | granule_end
---------------+--------+---------------+-------------
0 | 500000 | 1 | 1
1 | 300000 | 2 | 4
2 | 123077 | 5 | 12
3 | 47512 | 13 | 33
4 | 18422 | 34 | 90
5 | 6908 | 91 | 244
6 | 2580 | 245 | 665
7 | 949 | 666 | 1808
8 | 349 | 1811 | 4901
9 | 129 | 4926 | 13333
10 | 47 | 13513 | 35714
11 | 17 | 37037 | 90909
12 | 7 | 100000 | 250000
13 | 2 | 333333 | 500000
14 | 1 | 1000000 | 1000000
*/
alter table granule drop column freq;
--


function for scanning high frequencies first:



create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
returns setof lexikon language plpgsql set search_path to 'stack' as $$
declare
m integer;
n integer := 0;
r record;
begin
for r in (select width_granule from granule order by width_granule desc) loop
return query( select *
from lexikon
where width_granule=r.width_granule
and lset>=p_lset_low and lset<=p_lset_high );
get diagnostics m = row_count;
n = n+m;
exit when n>=p_limit;
end loop;
end;$$;


results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)



first using the function we've written:



timing on
--
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 80.452 ms
*/
select * from f(20000, 30000, 5) order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 0.510 ms
*/


and then with a simple index scan:



select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 218.897 ms
*/
select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
/*
_id | word | frequency | lset | width_granule
-----+-----------+-----------+-------+---------------
141 | word23237 | 7092 | 23237 | 9
246 | word25112 | 4065 | 25112 | 8
275 | word23825 | 3636 | 23825 | 8
409 | word28660 | 2444 | 28660 | 8
418 | word29923 | 2392 | 29923 | 8
Time: 51.250 ms
*/
timing off
--
rollback;


Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.






share|improve this answer
























  • Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

    – vyegorov
    Feb 7 '13 at 8:14











  • @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

    – Jack Douglas
    Feb 7 '13 at 11:14











  • Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

    – vyegorov
    Feb 7 '13 at 12:04











  • @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

    – Jack Douglas
    Feb 7 '13 at 12:24



















20





+100











Setup



I am building on @Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4.



CREATE TABLE lexikon (
lex_id serial PRIMARY KEY
, word text
, frequency int NOT NULL -- we'd need to do more if NULL was allowed
, lset int
);

INSERT INTO lexikon(word, frequency, lset)
SELECT 'w' || g -- shorter with just 'w'
, (1000000 / row_number() OVER (ORDER BY random()))::int
, g
FROM generate_series(1,1000000) g


From here on I take a different route:



ANALYZE lexikon;


Auxiliary table



This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.



CREATE TABLE public.lex_freq AS
WITH x AS (
SELECT DISTINCT ON (f.row_min)
f.row_min, c.row_ct, c.frequency
FROM (
SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
FROM lexikon
GROUP BY 1
) c
JOIN ( -- list of steps in recursive search
VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
ORDER BY f.row_min, c.row_ct, c.frequency DESC
)
, y AS (
SELECT DISTINCT ON (frequency)
row_min, row_ct, frequency AS freq_min
, lag(frequency) OVER (ORDER BY row_min) AS freq_max
FROM x
ORDER BY frequency, row_min
-- if one frequency spans multiple ranges, pick the lowest row_min
)
SELECT row_min, row_ct, freq_min
, CASE freq_min <= freq_max
WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
WHEN FALSE THEN 'frequency = ' || freq_min
ELSE 'frequency >= ' || freq_min
END AS cond
FROM y
ORDER BY row_min;


Table looks like this:



row_min | row_ct  | freq_min | cond
--------+---------+----------+-------------
400 | 400 | 2500 | frequency >= 2500
1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
600000 | 1000000 | 1 | frequency = 1


As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):



REVOKE ALL ON public.lex_freq FROM public;
GRANT SELECT ON public.lex_freq TO public;


The table lex_freq serves three purposes:




  • Create needed partial indexes automatically.

  • Provide steps for iterative function.

  • Meta information for tuning.


Indexes



This DO statement creates all needed indexes:



DO
$$
DECLARE
_cond text;
BEGIN
FOR _cond IN
SELECT cond FROM public.lex_freq
LOOP
IF _cond LIKE 'frequency =%' THEN
EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
ELSE
EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
END IF;
END LOOP;
END
$$


All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:



SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB


Only 21 MB of indexes for 50 MB table so far.



I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.



There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:



CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
-- ...
CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
CREATE INDEX ON lexikon(lset) WHERE freqency = 1;


Function



The function is somewhat similar in style to @Jack's solution:



CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
RETURNS SETOF lexikon
$func$
DECLARE
_n int;
_rest int := _limit; -- init with _limit param
_cond text;
BEGIN
FOR _cond IN
SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
LOOP
-- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
RETURN QUERY EXECUTE '
SELECT *
FROM public.lexikon
WHERE ' || _cond || '
AND lset >= $1
AND lset <= $2
ORDER BY frequency DESC
LIMIT $3'
USING _lset_min, _lset_max, _rest;

GET DIAGNOSTICS _n = ROW_COUNT;
_rest := _rest - _n;
EXIT WHEN _rest < 1;
END LOOP;
END
$func$ LANGUAGE plpgsql STABLE;


Key differences:




  • dynamic SQL with RETURN QUERY EXECUTE.

    As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.


  • Dynamic LIMIT for every query step.

    This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional LIMIT in the function call to trim the surplus.



Benchmark



Setup



I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:





  1. The raw SQL query of the form:



    SELECT * 
    FROM lexikon
    WHERE lset >= 20000
    AND lset <= 30000
    ORDER BY frequency DESC
    LIMIT 5;



  2. The same after creating this index



    CREATE INDEX ON lexikon(lset);


    Needs about the same space as all my partial indexes together:



    SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB



  3. The function



    SELECT * FROM f_search(20000, 30000, 5);



Results



SELECT * FROM f_search(20000, 30000, 5);


1: Total runtime: 315.458 ms

2: Total runtime: 36.458 ms

3: Total runtime: 0.330 ms



SELECT * FROM f_search(60000, 65000, 100);


1: Total runtime: 294.819 ms

2: Total runtime: 18.915 ms

3: Total runtime: 1.414 ms



SELECT * FROM f_search(10000, 70000, 100);


1: Total runtime: 426.831 ms

2: Total runtime: 217.874 ms

3: Total runtime: 1.611 ms



SELECT * FROM f_search(1, 1000000, 5);


1: Total runtime: 2458.205 ms

2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.

3: Total runtime: 0.266 ms



Conclusion



As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.



With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.



Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.






share|improve this answer

































    0














    I do not see any reason to include the word column in the index. So this index



    CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)


    will make your query to perform fast.



    UPD



    Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php






    share|improve this answer





















    • 1





      It was included to make the index "covering".

      – ypercubeᵀᴹ
      Aug 7 '12 at 11:04













    • But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

      – jcolebrand
      Aug 9 '12 at 16:26











    • Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

      – grayhemp
      Aug 10 '12 at 8:20













    • About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

      – j.p.
      Aug 10 '12 at 10:17



















    0














    Using a GIST index




    Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?




    It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,



    Here we create a table with 10k rows of (5::int,random()::double precision)



    CREATE EXTENSION IF NOT EXISTS btree_gin;
    CREATE TABLE t AS
    SELECT 5::int AS foo, random() AS bar
    FROM generate_series(1,1e4) AS gs(x);


    We index it,



    CREATE INDEX ON t USING gist (foo, bar);
    ANALYZE t;


    We query it,



    EXPLAIN ANALYZE
    SELECT *
    FROM t
    WHERE foo BETWEEN 1 AND 6
    ORDER BY bar DESC
    FETCH FIRST ROW ONLY;


    We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".



    INSERT INTO t(foo,bar)
    SELECT 42::int, x
    FROM generate_series(1,1e6) AS gs(x);

    VACUUM ANALYZE t;


    And then we requery,



    EXPLAIN ANALYZE
    SELECT *
    FROM t
    WHERE foo BETWEEN 1 AND 6
    ORDER BY bar DESC
    FETCH FIRST ROW ONLY;


    You can see here we complete in 4.6 MS with an Index Only Scan,



                                                                     QUERY PLAN                                                                  
    ---------------------------------------------------------------------------------------------------------------------------------------------
    Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
    -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
    Sort Key: bar DESC
    Sort Method: top-N heapsort Memory: 25kB
    -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
    Index Cond: ((foo >= 1) AND (foo <= 6))
    Heap Fetches: 0
    Planning time: 0.144 ms
    Execution time: 4.678 ms
    (9 rows)


    Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.



    So in summary,




    • It'll perform fast, for the quantity of data.

    • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.






    share|improve this answer























      Your Answer








      StackExchange.ready(function() {
      var channelOptions = {
      tags: "".split(" "),
      id: "182"
      };
      initTagRenderer("".split(" "), "".split(" "), channelOptions);

      StackExchange.using("externalEditor", function() {
      // Have to fire editor after snippets, if snippets enabled
      if (StackExchange.settings.snippets.snippetsEnabled) {
      StackExchange.using("snippets", function() {
      createEditor();
      });
      }
      else {
      createEditor();
      }
      });

      function createEditor() {
      StackExchange.prepareEditor({
      heartbeatType: 'answer',
      autoActivateHeartbeat: false,
      convertImagesToLinks: false,
      noModals: true,
      showLowRepImageUploadWarning: true,
      reputationToPostImages: null,
      bindNavPrevention: true,
      postfix: "",
      imageUploader: {
      brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
      contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
      allowUrls: true
      },
      onDemand: true,
      discardSelector: ".discard-answer"
      ,immediatelyShowMarkdownHelp:true
      });


      }
      });














      draft saved

      draft discarded


















      StackExchange.ready(
      function () {
      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f18300%2fcan-spatial-index-help-a-range-order-by-limit-query%23new-answer', 'question_page');
      }
      );

      Post as a guest















      Required, but never shown

























      4 Answers
      4






      active

      oldest

      votes








      4 Answers
      4






      active

      oldest

      votes









      active

      oldest

      votes






      active

      oldest

      votes









      27





      +50










      You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:



      --testbed and lexikon dummy data:



      begin;
      set role dba;
      create role stack;
      grant stack to dba;
      create schema authorization stack;
      set role stack;
      --
      create table lexikon( _id serial,
      word text,
      frequency integer,
      lset integer,
      width_granule integer);
      --
      insert into lexikon(word, frequency, lset)
      select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
      from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
      --
      update lexikon set width_granule=ln(frequency)::integer;
      --
      create index on lexikon(width_granule, lset);
      create index on lexikon(lset);
      -- the second index is not used with the function but is added to make the timings 'fair'


      granule analysis (mostly for information and tuning):



      create table granule as 
      select width_granule, count(*) as freq,
      min(frequency) as granule_start, max(frequency) as granule_end
      from lexikon group by width_granule;
      --
      select * from granule order by 1;
      /*
      width_granule | freq | granule_start | granule_end
      ---------------+--------+---------------+-------------
      0 | 500000 | 1 | 1
      1 | 300000 | 2 | 4
      2 | 123077 | 5 | 12
      3 | 47512 | 13 | 33
      4 | 18422 | 34 | 90
      5 | 6908 | 91 | 244
      6 | 2580 | 245 | 665
      7 | 949 | 666 | 1808
      8 | 349 | 1811 | 4901
      9 | 129 | 4926 | 13333
      10 | 47 | 13513 | 35714
      11 | 17 | 37037 | 90909
      12 | 7 | 100000 | 250000
      13 | 2 | 333333 | 500000
      14 | 1 | 1000000 | 1000000
      */
      alter table granule drop column freq;
      --


      function for scanning high frequencies first:



      create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
      returns setof lexikon language plpgsql set search_path to 'stack' as $$
      declare
      m integer;
      n integer := 0;
      r record;
      begin
      for r in (select width_granule from granule order by width_granule desc) loop
      return query( select *
      from lexikon
      where width_granule=r.width_granule
      and lset>=p_lset_low and lset<=p_lset_high );
      get diagnostics m = row_count;
      n = n+m;
      exit when n>=p_limit;
      end loop;
      end;$$;


      results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)



      first using the function we've written:



      timing on
      --
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 80.452 ms
      */
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 0.510 ms
      */


      and then with a simple index scan:



      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 218.897 ms
      */
      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 51.250 ms
      */
      timing off
      --
      rollback;


      Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.






      share|improve this answer
























      • Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

        – vyegorov
        Feb 7 '13 at 8:14











      • @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

        – Jack Douglas
        Feb 7 '13 at 11:14











      • Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

        – vyegorov
        Feb 7 '13 at 12:04











      • @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

        – Jack Douglas
        Feb 7 '13 at 12:24
















      27





      +50










      You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:



      --testbed and lexikon dummy data:



      begin;
      set role dba;
      create role stack;
      grant stack to dba;
      create schema authorization stack;
      set role stack;
      --
      create table lexikon( _id serial,
      word text,
      frequency integer,
      lset integer,
      width_granule integer);
      --
      insert into lexikon(word, frequency, lset)
      select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
      from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
      --
      update lexikon set width_granule=ln(frequency)::integer;
      --
      create index on lexikon(width_granule, lset);
      create index on lexikon(lset);
      -- the second index is not used with the function but is added to make the timings 'fair'


      granule analysis (mostly for information and tuning):



      create table granule as 
      select width_granule, count(*) as freq,
      min(frequency) as granule_start, max(frequency) as granule_end
      from lexikon group by width_granule;
      --
      select * from granule order by 1;
      /*
      width_granule | freq | granule_start | granule_end
      ---------------+--------+---------------+-------------
      0 | 500000 | 1 | 1
      1 | 300000 | 2 | 4
      2 | 123077 | 5 | 12
      3 | 47512 | 13 | 33
      4 | 18422 | 34 | 90
      5 | 6908 | 91 | 244
      6 | 2580 | 245 | 665
      7 | 949 | 666 | 1808
      8 | 349 | 1811 | 4901
      9 | 129 | 4926 | 13333
      10 | 47 | 13513 | 35714
      11 | 17 | 37037 | 90909
      12 | 7 | 100000 | 250000
      13 | 2 | 333333 | 500000
      14 | 1 | 1000000 | 1000000
      */
      alter table granule drop column freq;
      --


      function for scanning high frequencies first:



      create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
      returns setof lexikon language plpgsql set search_path to 'stack' as $$
      declare
      m integer;
      n integer := 0;
      r record;
      begin
      for r in (select width_granule from granule order by width_granule desc) loop
      return query( select *
      from lexikon
      where width_granule=r.width_granule
      and lset>=p_lset_low and lset<=p_lset_high );
      get diagnostics m = row_count;
      n = n+m;
      exit when n>=p_limit;
      end loop;
      end;$$;


      results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)



      first using the function we've written:



      timing on
      --
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 80.452 ms
      */
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 0.510 ms
      */


      and then with a simple index scan:



      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 218.897 ms
      */
      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 51.250 ms
      */
      timing off
      --
      rollback;


      Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.






      share|improve this answer
























      • Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

        – vyegorov
        Feb 7 '13 at 8:14











      • @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

        – Jack Douglas
        Feb 7 '13 at 11:14











      • Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

        – vyegorov
        Feb 7 '13 at 12:04











      • @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

        – Jack Douglas
        Feb 7 '13 at 12:24














      27





      +50







      27





      +50



      27




      +50






      You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:



      --testbed and lexikon dummy data:



      begin;
      set role dba;
      create role stack;
      grant stack to dba;
      create schema authorization stack;
      set role stack;
      --
      create table lexikon( _id serial,
      word text,
      frequency integer,
      lset integer,
      width_granule integer);
      --
      insert into lexikon(word, frequency, lset)
      select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
      from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
      --
      update lexikon set width_granule=ln(frequency)::integer;
      --
      create index on lexikon(width_granule, lset);
      create index on lexikon(lset);
      -- the second index is not used with the function but is added to make the timings 'fair'


      granule analysis (mostly for information and tuning):



      create table granule as 
      select width_granule, count(*) as freq,
      min(frequency) as granule_start, max(frequency) as granule_end
      from lexikon group by width_granule;
      --
      select * from granule order by 1;
      /*
      width_granule | freq | granule_start | granule_end
      ---------------+--------+---------------+-------------
      0 | 500000 | 1 | 1
      1 | 300000 | 2 | 4
      2 | 123077 | 5 | 12
      3 | 47512 | 13 | 33
      4 | 18422 | 34 | 90
      5 | 6908 | 91 | 244
      6 | 2580 | 245 | 665
      7 | 949 | 666 | 1808
      8 | 349 | 1811 | 4901
      9 | 129 | 4926 | 13333
      10 | 47 | 13513 | 35714
      11 | 17 | 37037 | 90909
      12 | 7 | 100000 | 250000
      13 | 2 | 333333 | 500000
      14 | 1 | 1000000 | 1000000
      */
      alter table granule drop column freq;
      --


      function for scanning high frequencies first:



      create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
      returns setof lexikon language plpgsql set search_path to 'stack' as $$
      declare
      m integer;
      n integer := 0;
      r record;
      begin
      for r in (select width_granule from granule order by width_granule desc) loop
      return query( select *
      from lexikon
      where width_granule=r.width_granule
      and lset>=p_lset_low and lset<=p_lset_high );
      get diagnostics m = row_count;
      n = n+m;
      exit when n>=p_limit;
      end loop;
      end;$$;


      results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)



      first using the function we've written:



      timing on
      --
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 80.452 ms
      */
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 0.510 ms
      */


      and then with a simple index scan:



      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 218.897 ms
      */
      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 51.250 ms
      */
      timing off
      --
      rollback;


      Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.






      share|improve this answer














      You may be able to achieve better performance by searching first in rows with higher frequencies. This can be achieved by 'granulating' the frequencies and then stepping through them procedurally, for example as follows:



      --testbed and lexikon dummy data:



      begin;
      set role dba;
      create role stack;
      grant stack to dba;
      create schema authorization stack;
      set role stack;
      --
      create table lexikon( _id serial,
      word text,
      frequency integer,
      lset integer,
      width_granule integer);
      --
      insert into lexikon(word, frequency, lset)
      select word, (1000000/row_number() over(order by random()))::integer as frequency, lset
      from (select 'word'||generate_series(1,1000000) word, generate_series(1,1000000) lset) z;
      --
      update lexikon set width_granule=ln(frequency)::integer;
      --
      create index on lexikon(width_granule, lset);
      create index on lexikon(lset);
      -- the second index is not used with the function but is added to make the timings 'fair'


      granule analysis (mostly for information and tuning):



      create table granule as 
      select width_granule, count(*) as freq,
      min(frequency) as granule_start, max(frequency) as granule_end
      from lexikon group by width_granule;
      --
      select * from granule order by 1;
      /*
      width_granule | freq | granule_start | granule_end
      ---------------+--------+---------------+-------------
      0 | 500000 | 1 | 1
      1 | 300000 | 2 | 4
      2 | 123077 | 5 | 12
      3 | 47512 | 13 | 33
      4 | 18422 | 34 | 90
      5 | 6908 | 91 | 244
      6 | 2580 | 245 | 665
      7 | 949 | 666 | 1808
      8 | 349 | 1811 | 4901
      9 | 129 | 4926 | 13333
      10 | 47 | 13513 | 35714
      11 | 17 | 37037 | 90909
      12 | 7 | 100000 | 250000
      13 | 2 | 333333 | 500000
      14 | 1 | 1000000 | 1000000
      */
      alter table granule drop column freq;
      --


      function for scanning high frequencies first:



      create function f(p_lset_low in integer, p_lset_high in integer, p_limit in integer)
      returns setof lexikon language plpgsql set search_path to 'stack' as $$
      declare
      m integer;
      n integer := 0;
      r record;
      begin
      for r in (select width_granule from granule order by width_granule desc) loop
      return query( select *
      from lexikon
      where width_granule=r.width_granule
      and lset>=p_lset_low and lset<=p_lset_high );
      get diagnostics m = row_count;
      n = n+m;
      exit when n>=p_limit;
      end loop;
      end;$$;


      results (timings should probably be taken with a pinch of salt but each query is run twice to counter any caching)



      first using the function we've written:



      timing on
      --
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 80.452 ms
      */
      select * from f(20000, 30000, 5) order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 0.510 ms
      */


      and then with a simple index scan:



      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 218.897 ms
      */
      select * from lexikon where lset between 20000 and 30000 order by frequency desc limit 5;
      /*
      _id | word | frequency | lset | width_granule
      -----+-----------+-----------+-------+---------------
      141 | word23237 | 7092 | 23237 | 9
      246 | word25112 | 4065 | 25112 | 8
      275 | word23825 | 3636 | 23825 | 8
      409 | word28660 | 2444 | 28660 | 8
      418 | word29923 | 2392 | 29923 | 8
      Time: 51.250 ms
      */
      timing off
      --
      rollback;


      Depending on your real-world data, you will probably want to vary the number of granules and the function used for putting rows into them. The actual distribution of frequencies is key here, as is the expected values for the limit clause and size of lset ranges sought.







      share|improve this answer












      share|improve this answer



      share|improve this answer










      answered Aug 10 '12 at 14:21









      Jack DouglasJack Douglas

      27.6k1075148




      27.6k1075148













      • Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

        – vyegorov
        Feb 7 '13 at 8:14











      • @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

        – Jack Douglas
        Feb 7 '13 at 11:14











      • Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

        – vyegorov
        Feb 7 '13 at 12:04











      • @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

        – Jack Douglas
        Feb 7 '13 at 12:24



















      • Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

        – vyegorov
        Feb 7 '13 at 8:14











      • @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

        – Jack Douglas
        Feb 7 '13 at 11:14











      • Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

        – vyegorov
        Feb 7 '13 at 12:04











      • @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

        – Jack Douglas
        Feb 7 '13 at 12:24

















      Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

      – vyegorov
      Feb 7 '13 at 8:14





      Why there's a gap starting from width_granule=8 between granulae_start and granulae_end of the previous level?

      – vyegorov
      Feb 7 '13 at 8:14













      @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

      – Jack Douglas
      Feb 7 '13 at 11:14





      @vyegorov because there aren't any values 1809 and 1810? This is randomly generated data so YMMV :)

      – Jack Douglas
      Feb 7 '13 at 11:14













      Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

      – vyegorov
      Feb 7 '13 at 12:04





      Hm, seems it has nothing to do with randomness, but rather with the way frequency is generated: a big gap between 1e6/2 and 1e6/3, the higher row number becomes, the smaller gap is. Anyway, Thank you for this awesome approach!!

      – vyegorov
      Feb 7 '13 at 12:04













      @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

      – Jack Douglas
      Feb 7 '13 at 12:24





      @vyegorov sorry, yes, you are right. Be sure to take a look at Erwins improvements if you haven't already!

      – Jack Douglas
      Feb 7 '13 at 12:24













      20





      +100











      Setup



      I am building on @Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4.



      CREATE TABLE lexikon (
      lex_id serial PRIMARY KEY
      , word text
      , frequency int NOT NULL -- we'd need to do more if NULL was allowed
      , lset int
      );

      INSERT INTO lexikon(word, frequency, lset)
      SELECT 'w' || g -- shorter with just 'w'
      , (1000000 / row_number() OVER (ORDER BY random()))::int
      , g
      FROM generate_series(1,1000000) g


      From here on I take a different route:



      ANALYZE lexikon;


      Auxiliary table



      This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.



      CREATE TABLE public.lex_freq AS
      WITH x AS (
      SELECT DISTINCT ON (f.row_min)
      f.row_min, c.row_ct, c.frequency
      FROM (
      SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
      FROM lexikon
      GROUP BY 1
      ) c
      JOIN ( -- list of steps in recursive search
      VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
      ) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
      ORDER BY f.row_min, c.row_ct, c.frequency DESC
      )
      , y AS (
      SELECT DISTINCT ON (frequency)
      row_min, row_ct, frequency AS freq_min
      , lag(frequency) OVER (ORDER BY row_min) AS freq_max
      FROM x
      ORDER BY frequency, row_min
      -- if one frequency spans multiple ranges, pick the lowest row_min
      )
      SELECT row_min, row_ct, freq_min
      , CASE freq_min <= freq_max
      WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
      WHEN FALSE THEN 'frequency = ' || freq_min
      ELSE 'frequency >= ' || freq_min
      END AS cond
      FROM y
      ORDER BY row_min;


      Table looks like this:



      row_min | row_ct  | freq_min | cond
      --------+---------+----------+-------------
      400 | 400 | 2500 | frequency >= 2500
      1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
      6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
      25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
      100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
      200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
      400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
      600000 | 1000000 | 1 | frequency = 1


      As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):



      REVOKE ALL ON public.lex_freq FROM public;
      GRANT SELECT ON public.lex_freq TO public;


      The table lex_freq serves three purposes:




      • Create needed partial indexes automatically.

      • Provide steps for iterative function.

      • Meta information for tuning.


      Indexes



      This DO statement creates all needed indexes:



      DO
      $$
      DECLARE
      _cond text;
      BEGIN
      FOR _cond IN
      SELECT cond FROM public.lex_freq
      LOOP
      IF _cond LIKE 'frequency =%' THEN
      EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
      ELSE
      EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
      END IF;
      END LOOP;
      END
      $$


      All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:



      SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
      SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB


      Only 21 MB of indexes for 50 MB table so far.



      I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.



      There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:



      CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
      CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
      -- ...
      CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
      CREATE INDEX ON lexikon(lset) WHERE freqency = 1;


      Function



      The function is somewhat similar in style to @Jack's solution:



      CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
      RETURNS SETOF lexikon
      $func$
      DECLARE
      _n int;
      _rest int := _limit; -- init with _limit param
      _cond text;
      BEGIN
      FOR _cond IN
      SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
      LOOP
      -- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
      RETURN QUERY EXECUTE '
      SELECT *
      FROM public.lexikon
      WHERE ' || _cond || '
      AND lset >= $1
      AND lset <= $2
      ORDER BY frequency DESC
      LIMIT $3'
      USING _lset_min, _lset_max, _rest;

      GET DIAGNOSTICS _n = ROW_COUNT;
      _rest := _rest - _n;
      EXIT WHEN _rest < 1;
      END LOOP;
      END
      $func$ LANGUAGE plpgsql STABLE;


      Key differences:




      • dynamic SQL with RETURN QUERY EXECUTE.

        As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.


      • Dynamic LIMIT for every query step.

        This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional LIMIT in the function call to trim the surplus.



      Benchmark



      Setup



      I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:





      1. The raw SQL query of the form:



        SELECT * 
        FROM lexikon
        WHERE lset >= 20000
        AND lset <= 30000
        ORDER BY frequency DESC
        LIMIT 5;



      2. The same after creating this index



        CREATE INDEX ON lexikon(lset);


        Needs about the same space as all my partial indexes together:



        SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB



      3. The function



        SELECT * FROM f_search(20000, 30000, 5);



      Results



      SELECT * FROM f_search(20000, 30000, 5);


      1: Total runtime: 315.458 ms

      2: Total runtime: 36.458 ms

      3: Total runtime: 0.330 ms



      SELECT * FROM f_search(60000, 65000, 100);


      1: Total runtime: 294.819 ms

      2: Total runtime: 18.915 ms

      3: Total runtime: 1.414 ms



      SELECT * FROM f_search(10000, 70000, 100);


      1: Total runtime: 426.831 ms

      2: Total runtime: 217.874 ms

      3: Total runtime: 1.611 ms



      SELECT * FROM f_search(1, 1000000, 5);


      1: Total runtime: 2458.205 ms

      2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.

      3: Total runtime: 0.266 ms



      Conclusion



      As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.



      With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.



      Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.






      share|improve this answer






























        20





        +100











        Setup



        I am building on @Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4.



        CREATE TABLE lexikon (
        lex_id serial PRIMARY KEY
        , word text
        , frequency int NOT NULL -- we'd need to do more if NULL was allowed
        , lset int
        );

        INSERT INTO lexikon(word, frequency, lset)
        SELECT 'w' || g -- shorter with just 'w'
        , (1000000 / row_number() OVER (ORDER BY random()))::int
        , g
        FROM generate_series(1,1000000) g


        From here on I take a different route:



        ANALYZE lexikon;


        Auxiliary table



        This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.



        CREATE TABLE public.lex_freq AS
        WITH x AS (
        SELECT DISTINCT ON (f.row_min)
        f.row_min, c.row_ct, c.frequency
        FROM (
        SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
        FROM lexikon
        GROUP BY 1
        ) c
        JOIN ( -- list of steps in recursive search
        VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
        ) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
        ORDER BY f.row_min, c.row_ct, c.frequency DESC
        )
        , y AS (
        SELECT DISTINCT ON (frequency)
        row_min, row_ct, frequency AS freq_min
        , lag(frequency) OVER (ORDER BY row_min) AS freq_max
        FROM x
        ORDER BY frequency, row_min
        -- if one frequency spans multiple ranges, pick the lowest row_min
        )
        SELECT row_min, row_ct, freq_min
        , CASE freq_min <= freq_max
        WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
        WHEN FALSE THEN 'frequency = ' || freq_min
        ELSE 'frequency >= ' || freq_min
        END AS cond
        FROM y
        ORDER BY row_min;


        Table looks like this:



        row_min | row_ct  | freq_min | cond
        --------+---------+----------+-------------
        400 | 400 | 2500 | frequency >= 2500
        1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
        6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
        25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
        100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
        200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
        400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
        600000 | 1000000 | 1 | frequency = 1


        As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):



        REVOKE ALL ON public.lex_freq FROM public;
        GRANT SELECT ON public.lex_freq TO public;


        The table lex_freq serves three purposes:




        • Create needed partial indexes automatically.

        • Provide steps for iterative function.

        • Meta information for tuning.


        Indexes



        This DO statement creates all needed indexes:



        DO
        $$
        DECLARE
        _cond text;
        BEGIN
        FOR _cond IN
        SELECT cond FROM public.lex_freq
        LOOP
        IF _cond LIKE 'frequency =%' THEN
        EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
        ELSE
        EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
        END IF;
        END LOOP;
        END
        $$


        All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:



        SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
        SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB


        Only 21 MB of indexes for 50 MB table so far.



        I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.



        There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:



        CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
        CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
        -- ...
        CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
        CREATE INDEX ON lexikon(lset) WHERE freqency = 1;


        Function



        The function is somewhat similar in style to @Jack's solution:



        CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
        RETURNS SETOF lexikon
        $func$
        DECLARE
        _n int;
        _rest int := _limit; -- init with _limit param
        _cond text;
        BEGIN
        FOR _cond IN
        SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
        LOOP
        -- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
        RETURN QUERY EXECUTE '
        SELECT *
        FROM public.lexikon
        WHERE ' || _cond || '
        AND lset >= $1
        AND lset <= $2
        ORDER BY frequency DESC
        LIMIT $3'
        USING _lset_min, _lset_max, _rest;

        GET DIAGNOSTICS _n = ROW_COUNT;
        _rest := _rest - _n;
        EXIT WHEN _rest < 1;
        END LOOP;
        END
        $func$ LANGUAGE plpgsql STABLE;


        Key differences:




        • dynamic SQL with RETURN QUERY EXECUTE.

          As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.


        • Dynamic LIMIT for every query step.

          This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional LIMIT in the function call to trim the surplus.



        Benchmark



        Setup



        I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:





        1. The raw SQL query of the form:



          SELECT * 
          FROM lexikon
          WHERE lset >= 20000
          AND lset <= 30000
          ORDER BY frequency DESC
          LIMIT 5;



        2. The same after creating this index



          CREATE INDEX ON lexikon(lset);


          Needs about the same space as all my partial indexes together:



          SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB



        3. The function



          SELECT * FROM f_search(20000, 30000, 5);



        Results



        SELECT * FROM f_search(20000, 30000, 5);


        1: Total runtime: 315.458 ms

        2: Total runtime: 36.458 ms

        3: Total runtime: 0.330 ms



        SELECT * FROM f_search(60000, 65000, 100);


        1: Total runtime: 294.819 ms

        2: Total runtime: 18.915 ms

        3: Total runtime: 1.414 ms



        SELECT * FROM f_search(10000, 70000, 100);


        1: Total runtime: 426.831 ms

        2: Total runtime: 217.874 ms

        3: Total runtime: 1.611 ms



        SELECT * FROM f_search(1, 1000000, 5);


        1: Total runtime: 2458.205 ms

        2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.

        3: Total runtime: 0.266 ms



        Conclusion



        As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.



        With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.



        Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.






        share|improve this answer




























          20





          +100







          20





          +100



          20




          +100







          Setup



          I am building on @Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4.



          CREATE TABLE lexikon (
          lex_id serial PRIMARY KEY
          , word text
          , frequency int NOT NULL -- we'd need to do more if NULL was allowed
          , lset int
          );

          INSERT INTO lexikon(word, frequency, lset)
          SELECT 'w' || g -- shorter with just 'w'
          , (1000000 / row_number() OVER (ORDER BY random()))::int
          , g
          FROM generate_series(1,1000000) g


          From here on I take a different route:



          ANALYZE lexikon;


          Auxiliary table



          This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.



          CREATE TABLE public.lex_freq AS
          WITH x AS (
          SELECT DISTINCT ON (f.row_min)
          f.row_min, c.row_ct, c.frequency
          FROM (
          SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
          FROM lexikon
          GROUP BY 1
          ) c
          JOIN ( -- list of steps in recursive search
          VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
          ) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
          ORDER BY f.row_min, c.row_ct, c.frequency DESC
          )
          , y AS (
          SELECT DISTINCT ON (frequency)
          row_min, row_ct, frequency AS freq_min
          , lag(frequency) OVER (ORDER BY row_min) AS freq_max
          FROM x
          ORDER BY frequency, row_min
          -- if one frequency spans multiple ranges, pick the lowest row_min
          )
          SELECT row_min, row_ct, freq_min
          , CASE freq_min <= freq_max
          WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
          WHEN FALSE THEN 'frequency = ' || freq_min
          ELSE 'frequency >= ' || freq_min
          END AS cond
          FROM y
          ORDER BY row_min;


          Table looks like this:



          row_min | row_ct  | freq_min | cond
          --------+---------+----------+-------------
          400 | 400 | 2500 | frequency >= 2500
          1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
          6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
          25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
          100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
          200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
          400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
          600000 | 1000000 | 1 | frequency = 1


          As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):



          REVOKE ALL ON public.lex_freq FROM public;
          GRANT SELECT ON public.lex_freq TO public;


          The table lex_freq serves three purposes:




          • Create needed partial indexes automatically.

          • Provide steps for iterative function.

          • Meta information for tuning.


          Indexes



          This DO statement creates all needed indexes:



          DO
          $$
          DECLARE
          _cond text;
          BEGIN
          FOR _cond IN
          SELECT cond FROM public.lex_freq
          LOOP
          IF _cond LIKE 'frequency =%' THEN
          EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
          ELSE
          EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
          END IF;
          END LOOP;
          END
          $$


          All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:



          SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
          SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB


          Only 21 MB of indexes for 50 MB table so far.



          I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.



          There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:



          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
          -- ...
          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
          CREATE INDEX ON lexikon(lset) WHERE freqency = 1;


          Function



          The function is somewhat similar in style to @Jack's solution:



          CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
          RETURNS SETOF lexikon
          $func$
          DECLARE
          _n int;
          _rest int := _limit; -- init with _limit param
          _cond text;
          BEGIN
          FOR _cond IN
          SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
          LOOP
          -- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
          RETURN QUERY EXECUTE '
          SELECT *
          FROM public.lexikon
          WHERE ' || _cond || '
          AND lset >= $1
          AND lset <= $2
          ORDER BY frequency DESC
          LIMIT $3'
          USING _lset_min, _lset_max, _rest;

          GET DIAGNOSTICS _n = ROW_COUNT;
          _rest := _rest - _n;
          EXIT WHEN _rest < 1;
          END LOOP;
          END
          $func$ LANGUAGE plpgsql STABLE;


          Key differences:




          • dynamic SQL with RETURN QUERY EXECUTE.

            As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.


          • Dynamic LIMIT for every query step.

            This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional LIMIT in the function call to trim the surplus.



          Benchmark



          Setup



          I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:





          1. The raw SQL query of the form:



            SELECT * 
            FROM lexikon
            WHERE lset >= 20000
            AND lset <= 30000
            ORDER BY frequency DESC
            LIMIT 5;



          2. The same after creating this index



            CREATE INDEX ON lexikon(lset);


            Needs about the same space as all my partial indexes together:



            SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB



          3. The function



            SELECT * FROM f_search(20000, 30000, 5);



          Results



          SELECT * FROM f_search(20000, 30000, 5);


          1: Total runtime: 315.458 ms

          2: Total runtime: 36.458 ms

          3: Total runtime: 0.330 ms



          SELECT * FROM f_search(60000, 65000, 100);


          1: Total runtime: 294.819 ms

          2: Total runtime: 18.915 ms

          3: Total runtime: 1.414 ms



          SELECT * FROM f_search(10000, 70000, 100);


          1: Total runtime: 426.831 ms

          2: Total runtime: 217.874 ms

          3: Total runtime: 1.611 ms



          SELECT * FROM f_search(1, 1000000, 5);


          1: Total runtime: 2458.205 ms

          2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.

          3: Total runtime: 0.266 ms



          Conclusion



          As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.



          With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.



          Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.






          share|improve this answer

















          Setup



          I am building on @Jack's setup to make it easier for people to follow and compare. Tested with PostgreSQL 9.1.4.



          CREATE TABLE lexikon (
          lex_id serial PRIMARY KEY
          , word text
          , frequency int NOT NULL -- we'd need to do more if NULL was allowed
          , lset int
          );

          INSERT INTO lexikon(word, frequency, lset)
          SELECT 'w' || g -- shorter with just 'w'
          , (1000000 / row_number() OVER (ORDER BY random()))::int
          , g
          FROM generate_series(1,1000000) g


          From here on I take a different route:



          ANALYZE lexikon;


          Auxiliary table



          This solution does not add columns to the original table, it just needs a tiny helper table. I placed it in the schema public, use any schema of your choice.



          CREATE TABLE public.lex_freq AS
          WITH x AS (
          SELECT DISTINCT ON (f.row_min)
          f.row_min, c.row_ct, c.frequency
          FROM (
          SELECT frequency, sum(count(*)) OVER (ORDER BY frequency DESC) AS row_ct
          FROM lexikon
          GROUP BY 1
          ) c
          JOIN ( -- list of steps in recursive search
          VALUES (400),(1600),(6400),(25000),(100000),(200000),(400000),(600000),(800000)
          ) f(row_min) ON c.row_ct >= f.row_min -- match next greater number
          ORDER BY f.row_min, c.row_ct, c.frequency DESC
          )
          , y AS (
          SELECT DISTINCT ON (frequency)
          row_min, row_ct, frequency AS freq_min
          , lag(frequency) OVER (ORDER BY row_min) AS freq_max
          FROM x
          ORDER BY frequency, row_min
          -- if one frequency spans multiple ranges, pick the lowest row_min
          )
          SELECT row_min, row_ct, freq_min
          , CASE freq_min <= freq_max
          WHEN TRUE THEN 'frequency >= ' || freq_min || ' AND frequency < ' || freq_max
          WHEN FALSE THEN 'frequency = ' || freq_min
          ELSE 'frequency >= ' || freq_min
          END AS cond
          FROM y
          ORDER BY row_min;


          Table looks like this:



          row_min | row_ct  | freq_min | cond
          --------+---------+----------+-------------
          400 | 400 | 2500 | frequency >= 2500
          1600 | 1600 | 625 | frequency >= 625 AND frequency < 2500
          6400 | 6410 | 156 | frequency >= 156 AND frequency < 625
          25000 | 25000 | 40 | frequency >= 40 AND frequency < 156
          100000 | 100000 | 10 | frequency >= 10 AND frequency < 40
          200000 | 200000 | 5 | frequency >= 5 AND frequency < 10
          400000 | 500000 | 2 | frequency >= 2 AND frequency < 5
          600000 | 1000000 | 1 | frequency = 1


          As the column cond is going to be used in dynamic SQL further down, you have to make this table secure. Always schema-qualify the table if you cannot be sure of an appropriate current search_path, and revoke write privileges from public (and any other untrusted role):



          REVOKE ALL ON public.lex_freq FROM public;
          GRANT SELECT ON public.lex_freq TO public;


          The table lex_freq serves three purposes:




          • Create needed partial indexes automatically.

          • Provide steps for iterative function.

          • Meta information for tuning.


          Indexes



          This DO statement creates all needed indexes:



          DO
          $$
          DECLARE
          _cond text;
          BEGIN
          FOR _cond IN
          SELECT cond FROM public.lex_freq
          LOOP
          IF _cond LIKE 'frequency =%' THEN
          EXECUTE 'CREATE INDEX ON lexikon(lset) WHERE ' || _cond;
          ELSE
          EXECUTE 'CREATE INDEX ON lexikon(lset, frequency DESC) WHERE ' || _cond;
          END IF;
          END LOOP;
          END
          $$


          All of these partial indexes together span the table once. They are about the same size as one basic index on the whole table:



          SELECT pg_size_pretty(pg_relation_size('lexikon'));       -- 50 MB
          SELECT pg_size_pretty(pg_total_relation_size('lexikon')); -- 71 MB


          Only 21 MB of indexes for 50 MB table so far.



          I create most of the partial indexes on (lset, frequency DESC). The second column only helps in special cases. But as both involved columns are of type integer, due to the specifics of data alignment in combination with MAXALIGN in PostgreSQL, the second column does not make the index any bigger. It's a small win for hardly any cost.



          There is no point in doing that for partial indexes that only span a single frequency. Those are just on (lset). Created indexes look like this:



          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2500;
          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 625 AND frequency < 2500;
          -- ...
          CREATE INDEX ON lexikon(lset, frequency DESC) WHERE frequency >= 2 AND frequency < 5;
          CREATE INDEX ON lexikon(lset) WHERE freqency = 1;


          Function



          The function is somewhat similar in style to @Jack's solution:



          CREATE OR REPLACE FUNCTION f_search(_lset_min int, _lset_max int, _limit int)
          RETURNS SETOF lexikon
          $func$
          DECLARE
          _n int;
          _rest int := _limit; -- init with _limit param
          _cond text;
          BEGIN
          FOR _cond IN
          SELECT l.cond FROM public.lex_freq l ORDER BY l.row_min
          LOOP
          -- RAISE NOTICE '_cond: %, _limit: %', _cond, _rest; -- for debugging
          RETURN QUERY EXECUTE '
          SELECT *
          FROM public.lexikon
          WHERE ' || _cond || '
          AND lset >= $1
          AND lset <= $2
          ORDER BY frequency DESC
          LIMIT $3'
          USING _lset_min, _lset_max, _rest;

          GET DIAGNOSTICS _n = ROW_COUNT;
          _rest := _rest - _n;
          EXIT WHEN _rest < 1;
          END LOOP;
          END
          $func$ LANGUAGE plpgsql STABLE;


          Key differences:




          • dynamic SQL with RETURN QUERY EXECUTE.

            As we loop through the steps, a different query plan may be beneficiary. The query plan for static SQL is generated once and then reused - which can save some overhead. But in this case the query is simple and the values are very different. Dynamic SQL will be a big win.


          • Dynamic LIMIT for every query step.

            This helps in multiple ways: First, rows are only fetched as needed. In combination with dynamic SQL this may also generate different query plans to begin with. Second: No need for an additional LIMIT in the function call to trim the surplus.



          Benchmark



          Setup



          I picked four examples and ran three different tests with each. I took the best of five to compare with warm cache:





          1. The raw SQL query of the form:



            SELECT * 
            FROM lexikon
            WHERE lset >= 20000
            AND lset <= 30000
            ORDER BY frequency DESC
            LIMIT 5;



          2. The same after creating this index



            CREATE INDEX ON lexikon(lset);


            Needs about the same space as all my partial indexes together:



            SELECT pg_size_pretty(pg_total_relation_size('lexikon')) -- 93 MB



          3. The function



            SELECT * FROM f_search(20000, 30000, 5);



          Results



          SELECT * FROM f_search(20000, 30000, 5);


          1: Total runtime: 315.458 ms

          2: Total runtime: 36.458 ms

          3: Total runtime: 0.330 ms



          SELECT * FROM f_search(60000, 65000, 100);


          1: Total runtime: 294.819 ms

          2: Total runtime: 18.915 ms

          3: Total runtime: 1.414 ms



          SELECT * FROM f_search(10000, 70000, 100);


          1: Total runtime: 426.831 ms

          2: Total runtime: 217.874 ms

          3: Total runtime: 1.611 ms



          SELECT * FROM f_search(1, 1000000, 5);


          1: Total runtime: 2458.205 ms

          2: Total runtime: 2458.205 ms -- for large ranges of lset, seq scan is faster than index.

          3: Total runtime: 0.266 ms



          Conclusion



          As expected, the benefit from the function grows with bigger ranges of lset and smaller LIMIT.



          With very small ranges of lset, the raw query in combination with the index is actually faster. You'll want to test and maybe branch: raw query for small ranges of lset, else function call. You could even just build that into the function for a "best of both worlds" - that's what I would do.



          Depending on your data distribution and typical queries, more steps in lex_freq may help performance. Test to find the sweet spot. With the tools presented here, it should be easy to test.







          share|improve this answer














          share|improve this answer



          share|improve this answer








          edited Jun 11 '18 at 13:13

























          answered Aug 15 '12 at 3:36









          Erwin BrandstetterErwin Brandstetter

          91k9170283




          91k9170283























              0














              I do not see any reason to include the word column in the index. So this index



              CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)


              will make your query to perform fast.



              UPD



              Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php






              share|improve this answer





















              • 1





                It was included to make the index "covering".

                – ypercubeᵀᴹ
                Aug 7 '12 at 11:04













              • But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

                – jcolebrand
                Aug 9 '12 at 16:26











              • Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

                – grayhemp
                Aug 10 '12 at 8:20













              • About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

                – j.p.
                Aug 10 '12 at 10:17
















              0














              I do not see any reason to include the word column in the index. So this index



              CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)


              will make your query to perform fast.



              UPD



              Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php






              share|improve this answer





















              • 1





                It was included to make the index "covering".

                – ypercubeᵀᴹ
                Aug 7 '12 at 11:04













              • But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

                – jcolebrand
                Aug 9 '12 at 16:26











              • Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

                – grayhemp
                Aug 10 '12 at 8:20













              • About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

                – j.p.
                Aug 10 '12 at 10:17














              0












              0








              0







              I do not see any reason to include the word column in the index. So this index



              CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)


              will make your query to perform fast.



              UPD



              Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php






              share|improve this answer















              I do not see any reason to include the word column in the index. So this index



              CREATE INDEX lexikon_lset_frequency ON lexicon (lset, frequency DESC)


              will make your query to perform fast.



              UPD



              Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the PostgreSQL mailing list http://archives.postgresql.org/pgsql-performance/2012-06/msg00114.php







              share|improve this answer














              share|improve this answer



              share|improve this answer








              edited Aug 10 '12 at 8:22

























              answered Aug 7 '12 at 10:59









              grayhempgrayhemp

              1875




              1875








              • 1





                It was included to make the index "covering".

                – ypercubeᵀᴹ
                Aug 7 '12 at 11:04













              • But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

                – jcolebrand
                Aug 9 '12 at 16:26











              • Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

                – grayhemp
                Aug 10 '12 at 8:20













              • About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

                – j.p.
                Aug 10 '12 at 10:17














              • 1





                It was included to make the index "covering".

                – ypercubeᵀᴹ
                Aug 7 '12 at 11:04













              • But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

                – jcolebrand
                Aug 9 '12 at 16:26











              • Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

                – grayhemp
                Aug 10 '12 at 8:20













              • About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

                – j.p.
                Aug 10 '12 at 10:17








              1




              1





              It was included to make the index "covering".

              – ypercubeᵀᴹ
              Aug 7 '12 at 11:04







              It was included to make the index "covering".

              – ypercubeᵀᴹ
              Aug 7 '12 at 11:04















              But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

              – jcolebrand
              Aug 9 '12 at 16:26





              But by not searching for that term in the query decision tree, are you sure that the covering index is helping here?

              – jcolebrand
              Aug 9 '12 at 16:26













              Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

              – grayhemp
              Aug 10 '12 at 8:20







              Okay, I see now. Currently there are no ways to make a covering index in PostgreSQL. There were a discussion about this feature in the mailing list archives.postgresql.org/pgsql-performance/2012-06/msg00114.php.

              – grayhemp
              Aug 10 '12 at 8:20















              About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

              – j.p.
              Aug 10 '12 at 10:17





              About "Covering indexes" in PostgreSQL see also Erwin Brandstetter's comment to the question.

              – j.p.
              Aug 10 '12 at 10:17











              0














              Using a GIST index




              Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?




              It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,



              Here we create a table with 10k rows of (5::int,random()::double precision)



              CREATE EXTENSION IF NOT EXISTS btree_gin;
              CREATE TABLE t AS
              SELECT 5::int AS foo, random() AS bar
              FROM generate_series(1,1e4) AS gs(x);


              We index it,



              CREATE INDEX ON t USING gist (foo, bar);
              ANALYZE t;


              We query it,



              EXPLAIN ANALYZE
              SELECT *
              FROM t
              WHERE foo BETWEEN 1 AND 6
              ORDER BY bar DESC
              FETCH FIRST ROW ONLY;


              We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".



              INSERT INTO t(foo,bar)
              SELECT 42::int, x
              FROM generate_series(1,1e6) AS gs(x);

              VACUUM ANALYZE t;


              And then we requery,



              EXPLAIN ANALYZE
              SELECT *
              FROM t
              WHERE foo BETWEEN 1 AND 6
              ORDER BY bar DESC
              FETCH FIRST ROW ONLY;


              You can see here we complete in 4.6 MS with an Index Only Scan,



                                                                               QUERY PLAN                                                                  
              ---------------------------------------------------------------------------------------------------------------------------------------------
              Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
              -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
              Sort Key: bar DESC
              Sort Method: top-N heapsort Memory: 25kB
              -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
              Index Cond: ((foo >= 1) AND (foo <= 6))
              Heap Fetches: 0
              Planning time: 0.144 ms
              Execution time: 4.678 ms
              (9 rows)


              Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.



              So in summary,




              • It'll perform fast, for the quantity of data.

              • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.






              share|improve this answer




























                0














                Using a GIST index




                Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?




                It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,



                Here we create a table with 10k rows of (5::int,random()::double precision)



                CREATE EXTENSION IF NOT EXISTS btree_gin;
                CREATE TABLE t AS
                SELECT 5::int AS foo, random() AS bar
                FROM generate_series(1,1e4) AS gs(x);


                We index it,



                CREATE INDEX ON t USING gist (foo, bar);
                ANALYZE t;


                We query it,



                EXPLAIN ANALYZE
                SELECT *
                FROM t
                WHERE foo BETWEEN 1 AND 6
                ORDER BY bar DESC
                FETCH FIRST ROW ONLY;


                We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".



                INSERT INTO t(foo,bar)
                SELECT 42::int, x
                FROM generate_series(1,1e6) AS gs(x);

                VACUUM ANALYZE t;


                And then we requery,



                EXPLAIN ANALYZE
                SELECT *
                FROM t
                WHERE foo BETWEEN 1 AND 6
                ORDER BY bar DESC
                FETCH FIRST ROW ONLY;


                You can see here we complete in 4.6 MS with an Index Only Scan,



                                                                                 QUERY PLAN                                                                  
                ---------------------------------------------------------------------------------------------------------------------------------------------
                Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
                -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
                Sort Key: bar DESC
                Sort Method: top-N heapsort Memory: 25kB
                -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
                Index Cond: ((foo >= 1) AND (foo <= 6))
                Heap Fetches: 0
                Planning time: 0.144 ms
                Execution time: 4.678 ms
                (9 rows)


                Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.



                So in summary,




                • It'll perform fast, for the quantity of data.

                • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.






                share|improve this answer


























                  0












                  0








                  0







                  Using a GIST index




                  Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?




                  It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,



                  Here we create a table with 10k rows of (5::int,random()::double precision)



                  CREATE EXTENSION IF NOT EXISTS btree_gin;
                  CREATE TABLE t AS
                  SELECT 5::int AS foo, random() AS bar
                  FROM generate_series(1,1e4) AS gs(x);


                  We index it,



                  CREATE INDEX ON t USING gist (foo, bar);
                  ANALYZE t;


                  We query it,



                  EXPLAIN ANALYZE
                  SELECT *
                  FROM t
                  WHERE foo BETWEEN 1 AND 6
                  ORDER BY bar DESC
                  FETCH FIRST ROW ONLY;


                  We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".



                  INSERT INTO t(foo,bar)
                  SELECT 42::int, x
                  FROM generate_series(1,1e6) AS gs(x);

                  VACUUM ANALYZE t;


                  And then we requery,



                  EXPLAIN ANALYZE
                  SELECT *
                  FROM t
                  WHERE foo BETWEEN 1 AND 6
                  ORDER BY bar DESC
                  FETCH FIRST ROW ONLY;


                  You can see here we complete in 4.6 MS with an Index Only Scan,



                                                                                   QUERY PLAN                                                                  
                  ---------------------------------------------------------------------------------------------------------------------------------------------
                  Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
                  -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
                  Sort Key: bar DESC
                  Sort Method: top-N heapsort Memory: 25kB
                  -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
                  Index Cond: ((foo >= 1) AND (foo <= 6))
                  Heap Fetches: 0
                  Planning time: 0.144 ms
                  Execution time: 4.678 ms
                  (9 rows)


                  Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.



                  So in summary,




                  • It'll perform fast, for the quantity of data.

                  • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.






                  share|improve this answer













                  Using a GIST index




                  Is there a way to make it perform fast, regardless of whether the range (@Low, @High) is wide or narrow and regardless of whether the top frequency words are luckily in the (narrow) selected range?




                  It depends on what you mean when you fast: you obviously have to visit every row in the range because your query is ORDER freq DESC. Shy of that the query planner already covers this if I understand the question,



                  Here we create a table with 10k rows of (5::int,random()::double precision)



                  CREATE EXTENSION IF NOT EXISTS btree_gin;
                  CREATE TABLE t AS
                  SELECT 5::int AS foo, random() AS bar
                  FROM generate_series(1,1e4) AS gs(x);


                  We index it,



                  CREATE INDEX ON t USING gist (foo, bar);
                  ANALYZE t;


                  We query it,



                  EXPLAIN ANALYZE
                  SELECT *
                  FROM t
                  WHERE foo BETWEEN 1 AND 6
                  ORDER BY bar DESC
                  FETCH FIRST ROW ONLY;


                  We get a Seq Scan on t. This is simply because our selectivity estimates let pg conclude heap access is faster than scanning an index and rechecking. So we make it more juicy by inserting 1,000,000 more rows of (42::int,random()::double precision) that don't fit our "range".



                  INSERT INTO t(foo,bar)
                  SELECT 42::int, x
                  FROM generate_series(1,1e6) AS gs(x);

                  VACUUM ANALYZE t;


                  And then we requery,



                  EXPLAIN ANALYZE
                  SELECT *
                  FROM t
                  WHERE foo BETWEEN 1 AND 6
                  ORDER BY bar DESC
                  FETCH FIRST ROW ONLY;


                  You can see here we complete in 4.6 MS with an Index Only Scan,



                                                                                   QUERY PLAN                                                                  
                  ---------------------------------------------------------------------------------------------------------------------------------------------
                  Limit (cost=617.64..617.64 rows=1 width=12) (actual time=4.652..4.652 rows=1 loops=1)
                  -> Sort (cost=617.64..642.97 rows=10134 width=12) (actual time=4.651..4.651 rows=1 loops=1)
                  Sort Key: bar DESC
                  Sort Method: top-N heapsort Memory: 25kB
                  -> Index Only Scan using t_foo_bar_idx on t (cost=0.29..566.97 rows=10134 width=12) (actual time=0.123..3.623 rows=10000 loops=1)
                  Index Cond: ((foo >= 1) AND (foo <= 6))
                  Heap Fetches: 0
                  Planning time: 0.144 ms
                  Execution time: 4.678 ms
                  (9 rows)


                  Expanding the range to include the whole table, produces another seq scan -- logically, and growing it with another billion rows would produce another Index Scan.



                  So in summary,




                  • It'll perform fast, for the quantity of data.

                  • Fast is relative to the alternative, if the range isn't selective enough a sequential scan may be as fast as you can get.







                  share|improve this answer












                  share|improve this answer



                  share|improve this answer










                  answered Jun 12 '18 at 18:40









                  Evan CarrollEvan Carroll

                  31.6k966214




                  31.6k966214






























                      draft saved

                      draft discarded




















































                      Thanks for contributing an answer to Database Administrators Stack Exchange!


                      • Please be sure to answer the question. Provide details and share your research!

                      But avoid



                      • Asking for help, clarification, or responding to other answers.

                      • Making statements based on opinion; back them up with references or personal experience.


                      To learn more, see our tips on writing great answers.




                      draft saved


                      draft discarded














                      StackExchange.ready(
                      function () {
                      StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f18300%2fcan-spatial-index-help-a-range-order-by-limit-query%23new-answer', 'question_page');
                      }
                      );

                      Post as a guest















                      Required, but never shown





















































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown

































                      Required, but never shown














                      Required, but never shown












                      Required, but never shown







                      Required, but never shown







                      Popular posts from this blog

                      SQL Server 17 - Attemping to backup to remote NAS but Access is denied

                      Always On Availability groups resolving state after failover - Remote harden of transaction...

                      Restoring from pg_dump with foreign key constraints