improve postgres query perfromance on windows with billions of records, TB data size

using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive

I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows

Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files

Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.

Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables

Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs

Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs

Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?

Also one of the anomalies I see in pgAdmin which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.

update

As suggested by @filiprem , adding the schema and query operations below

-- db tables

my_db_201710=# d

                         List of relations

 Schema |                  Name                  | Type  |  Owner

--------+----------------------------------------+-------+----------

 public | data_final                             | table | postgres   -- ~450 mill rows, ~50 GB

 public | data_distinct                          | table | postgres   -- ~450 mill rows, ~35 GB

 public | data_valid01                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw01                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid02                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw02                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid03                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw03                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid04                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw04                             | table | postgres   -- ~2   bill rows, ~350 GB

(10 rows)



-- import csv data into postgres 

--------------

@echo off

setlocal

set PGPASSWORD=admin



for %%f in (*.csv)  do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"



pause

endlocal

--------------



my_db_201710=# d+ data_raw01

                                          Table "public.data_raw01"

                 Column                  |       Type        | Modifiers | Storage  | Stats target | Description

-----------------------------------------+-------------------+-----------+----------+--------------+-------------

 type                                    | integer           |           | plain    |              |

 time                                    | bigint            |           | plain    |              |

 msg_type                                | integer           |           | plain    |              |

 si_id                                   | character varying |           | extended |              |

 mi_id                                   | character varying |           | extended |              |

 ti_id                                   | character varying |           | extended |              |

 session_id                              | character varying |           | extended |              |

 mip_address                             | character varying |           | extended |              |

 eip_address                             | character varying |           | extended |              |

 mm_id                                   | integer           |           | plain    |              |

 en_id                                   | integer           |           | plain    |              |



-- select valid records, columns with some column transformations

create table data_valid01 as (

    select report_data_time as epoch_time, mi_id, si_id,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,

        substring(si_id, 1,5 ) as si_hdr,

        substring(si_id, 6,10 ) as si_sn,

        lpad(to_hex(mm_id),8,'0') as mmid, 

        lpad(to_hex(en_id), 6, '0') as enid

    from data_raw01

    where mi_id is not null and mm_id is not null and en_id is not null

    ) ;



my_db_201710=# d+ data_valid01

                           Table "public.data_valid01"

   Column    |       Type        | Modifiers | Storage  | Stats target | Description

-------------+-------------------+-----------+----------+--------------+-------------

 epoch_time  | bigint            |           | plain    |              |

 si_id       | character varying |           | extended |              |

 mi_id       | character varying |           | extended |              |

 si_hdr      | text              |           | extended |              |

 si_sn       | text              |           | extended |              |

 mi_hdr      | text              |           | extended |              |

 mi_sn       | text              |           | extended |              |

 mmid        | text              |           | extended |              |

 enid        | text              |           | extended |              |



-- merge distinct from the 4 valid tables and select distinct rows from that

create table data_distinct as (

    select 

        distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn

        from (

            select *

                from data_valid01

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid02

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid03

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid04

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            ) as a

);





my_db_201710=# d+ data_distinct

         Table "public.data_distinct"

   Column    | Type | Modifiers | Storage  | Stats target | Description

-------------+------+-----------+----------+--------------+-------------

 mmid        | text |           | extended |              |

 enid        | text |           | extended |              |

 si_hdr      | text |           | extended |              |

 si_sn       | text |           | extended |              |

 mi_hdr      | text |           | extended |              |

 mi_sn       | text |           | extended |              |



-- concatenate some cols, hash * concatenate some other colums 

create extension pgcrypto;

create table data_final as (

    SELECT id, simi

    FROM

    (

        SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt

        FROM

        (

            select

                mmid || ',' || enid as Id,

                si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||

                    mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi

                from data_distinct

        ) t1

    ) t2

    where cnt = 1

); 



my_db_201710=# d+ data_final

            Table "public.data_final"

  Column  | Type | Modifiers | Storage  | Stats target | Description

----------+------+-----------+----------+--------------+-------------

 id       | text |           | extended |              |

 simi     | text |           | extended |              |





my_db_201710=#

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive

I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows

Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files

Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.

Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables

Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs

Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs

update

As suggested by @filiprem , adding the schema and query operations below

-- db tables

my_db_201710=# d

                         List of relations

 Schema |                  Name                  | Type  |  Owner

--------+----------------------------------------+-------+----------

 public | data_final                             | table | postgres   -- ~450 mill rows, ~50 GB

 public | data_distinct                          | table | postgres   -- ~450 mill rows, ~35 GB

 public | data_valid01                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw01                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid02                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw02                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid03                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw03                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid04                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw04                             | table | postgres   -- ~2   bill rows, ~350 GB

(10 rows)



-- import csv data into postgres 

--------------

@echo off

setlocal

set PGPASSWORD=admin



for %%f in (*.csv)  do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"



pause

endlocal

--------------



my_db_201710=# d+ data_raw01

                                          Table "public.data_raw01"

                 Column                  |       Type        | Modifiers | Storage  | Stats target | Description

-----------------------------------------+-------------------+-----------+----------+--------------+-------------

 type                                    | integer           |           | plain    |              |

 time                                    | bigint            |           | plain    |              |

 msg_type                                | integer           |           | plain    |              |

 si_id                                   | character varying |           | extended |              |

 mi_id                                   | character varying |           | extended |              |

 ti_id                                   | character varying |           | extended |              |

 session_id                              | character varying |           | extended |              |

 mip_address                             | character varying |           | extended |              |

 eip_address                             | character varying |           | extended |              |

 mm_id                                   | integer           |           | plain    |              |

 en_id                                   | integer           |           | plain    |              |



-- select valid records, columns with some column transformations

create table data_valid01 as (

    select report_data_time as epoch_time, mi_id, si_id,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,

        substring(si_id, 1,5 ) as si_hdr,

        substring(si_id, 6,10 ) as si_sn,

        lpad(to_hex(mm_id),8,'0') as mmid, 

        lpad(to_hex(en_id), 6, '0') as enid

    from data_raw01

    where mi_id is not null and mm_id is not null and en_id is not null

    ) ;



my_db_201710=# d+ data_valid01

                           Table "public.data_valid01"

   Column    |       Type        | Modifiers | Storage  | Stats target | Description

-------------+-------------------+-----------+----------+--------------+-------------

 epoch_time  | bigint            |           | plain    |              |

 si_id       | character varying |           | extended |              |

 mi_id       | character varying |           | extended |              |

 si_hdr      | text              |           | extended |              |

 si_sn       | text              |           | extended |              |

 mi_hdr      | text              |           | extended |              |

 mi_sn       | text              |           | extended |              |

 mmid        | text              |           | extended |              |

 enid        | text              |           | extended |              |



-- merge distinct from the 4 valid tables and select distinct rows from that

create table data_distinct as (

    select 

        distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn

        from (

            select *

                from data_valid01

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid02

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid03

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid04

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            ) as a

);





my_db_201710=# d+ data_distinct

         Table "public.data_distinct"

   Column    | Type | Modifiers | Storage  | Stats target | Description

-------------+------+-----------+----------+--------------+-------------

 mmid        | text |           | extended |              |

 enid        | text |           | extended |              |

 si_hdr      | text |           | extended |              |

 si_sn       | text |           | extended |              |

 mi_hdr      | text |           | extended |              |

 mi_sn       | text |           | extended |              |



-- concatenate some cols, hash * concatenate some other colums 

create extension pgcrypto;

create table data_final as (

    SELECT id, simi

    FROM

    (

        SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt

        FROM

        (

            select

                mmid || ',' || enid as Id,

                si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||

                    mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi

                from data_distinct

        ) t1

    ) t2

    where cnt = 1

); 



my_db_201710=# d+ data_final

            Table "public.data_final"

  Column  | Type | Modifiers | Storage  | Stats target | Description

----------+------+-----------+----------+--------------+-------------

 id       | text |           | extended |              |

 simi     | text |           | extended |              |





my_db_201710=#

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive

I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows

Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files

Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.

Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables

Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs

Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs

update

As suggested by @filiprem , adding the schema and query operations below

-- db tables

my_db_201710=# d

                         List of relations

 Schema |                  Name                  | Type  |  Owner

--------+----------------------------------------+-------+----------

 public | data_final                             | table | postgres   -- ~450 mill rows, ~50 GB

 public | data_distinct                          | table | postgres   -- ~450 mill rows, ~35 GB

 public | data_valid01                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw01                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid02                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw02                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid03                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw03                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid04                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw04                             | table | postgres   -- ~2   bill rows, ~350 GB

(10 rows)



-- import csv data into postgres 

--------------

@echo off

setlocal

set PGPASSWORD=admin



for %%f in (*.csv)  do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"



pause

endlocal

--------------



my_db_201710=# d+ data_raw01

                                          Table "public.data_raw01"

                 Column                  |       Type        | Modifiers | Storage  | Stats target | Description

-----------------------------------------+-------------------+-----------+----------+--------------+-------------

 type                                    | integer           |           | plain    |              |

 time                                    | bigint            |           | plain    |              |

 msg_type                                | integer           |           | plain    |              |

 si_id                                   | character varying |           | extended |              |

 mi_id                                   | character varying |           | extended |              |

 ti_id                                   | character varying |           | extended |              |

 session_id                              | character varying |           | extended |              |

 mip_address                             | character varying |           | extended |              |

 eip_address                             | character varying |           | extended |              |

 mm_id                                   | integer           |           | plain    |              |

 en_id                                   | integer           |           | plain    |              |



-- select valid records, columns with some column transformations

create table data_valid01 as (

    select report_data_time as epoch_time, mi_id, si_id,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,

        substring(si_id, 1,5 ) as si_hdr,

        substring(si_id, 6,10 ) as si_sn,

        lpad(to_hex(mm_id),8,'0') as mmid, 

        lpad(to_hex(en_id), 6, '0') as enid

    from data_raw01

    where mi_id is not null and mm_id is not null and en_id is not null

    ) ;



my_db_201710=# d+ data_valid01

                           Table "public.data_valid01"

   Column    |       Type        | Modifiers | Storage  | Stats target | Description

-------------+-------------------+-----------+----------+--------------+-------------

 epoch_time  | bigint            |           | plain    |              |

 si_id       | character varying |           | extended |              |

 mi_id       | character varying |           | extended |              |

 si_hdr      | text              |           | extended |              |

 si_sn       | text              |           | extended |              |

 mi_hdr      | text              |           | extended |              |

 mi_sn       | text              |           | extended |              |

 mmid        | text              |           | extended |              |

 enid        | text              |           | extended |              |



-- merge distinct from the 4 valid tables and select distinct rows from that

create table data_distinct as (

    select 

        distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn

        from (

            select *

                from data_valid01

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid02

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid03

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid04

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            ) as a

);





my_db_201710=# d+ data_distinct

         Table "public.data_distinct"

   Column    | Type | Modifiers | Storage  | Stats target | Description

-------------+------+-----------+----------+--------------+-------------

 mmid        | text |           | extended |              |

 enid        | text |           | extended |              |

 si_hdr      | text |           | extended |              |

 si_sn       | text |           | extended |              |

 mi_hdr      | text |           | extended |              |

 mi_sn       | text |           | extended |              |



-- concatenate some cols, hash * concatenate some other colums 

create extension pgcrypto;

create table data_final as (

    SELECT id, simi

    FROM

    (

        SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt

        FROM

        (

            select

                mmid || ',' || enid as Id,

                si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||

                    mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi

                from data_distinct

        ) t1

    ) t2

    where cnt = 1

); 



my_db_201710=# d+ data_final

            Table "public.data_final"

  Column  | Type | Modifiers | Storage  | Stats target | Description

----------+------+-----------+----------+--------------+-------------

 id       | text |           | extended |              |

 simi     | text |           | extended |              |





my_db_201710=#

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive

I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows

Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files

Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.

Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables

Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs

Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs

update

As suggested by @filiprem , adding the schema and query operations below

-- db tables

my_db_201710=# d

                         List of relations

 Schema |                  Name                  | Type  |  Owner

--------+----------------------------------------+-------+----------

 public | data_final                             | table | postgres   -- ~450 mill rows, ~50 GB

 public | data_distinct                          | table | postgres   -- ~450 mill rows, ~35 GB

 public | data_valid01                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw01                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid02                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw02                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid03                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw03                             | table | postgres   -- ~2   bill rows, ~350 GB

 public | data_valid04                           | table | postgres   -- ~1.8 bill rows, ~200 GB

 public | data_raw04                             | table | postgres   -- ~2   bill rows, ~350 GB

(10 rows)



-- import csv data into postgres 

--------------

@echo off

setlocal

set PGPASSWORD=admin



for %%f in (*.csv)  do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"



pause

endlocal

--------------



my_db_201710=# d+ data_raw01

                                          Table "public.data_raw01"

                 Column                  |       Type        | Modifiers | Storage  | Stats target | Description

-----------------------------------------+-------------------+-----------+----------+--------------+-------------

 type                                    | integer           |           | plain    |              |

 time                                    | bigint            |           | plain    |              |

 msg_type                                | integer           |           | plain    |              |

 si_id                                   | character varying |           | extended |              |

 mi_id                                   | character varying |           | extended |              |

 ti_id                                   | character varying |           | extended |              |

 session_id                              | character varying |           | extended |              |

 mip_address                             | character varying |           | extended |              |

 eip_address                             | character varying |           | extended |              |

 mm_id                                   | integer           |           | plain    |              |

 en_id                                   | integer           |           | plain    |              |



-- select valid records, columns with some column transformations

create table data_valid01 as (

    select report_data_time as epoch_time, mi_id, si_id,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,

        substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,

        substring(si_id, 1,5 ) as si_hdr,

        substring(si_id, 6,10 ) as si_sn,

        lpad(to_hex(mm_id),8,'0') as mmid, 

        lpad(to_hex(en_id), 6, '0') as enid

    from data_raw01

    where mi_id is not null and mm_id is not null and en_id is not null

    ) ;



my_db_201710=# d+ data_valid01

                           Table "public.data_valid01"

   Column    |       Type        | Modifiers | Storage  | Stats target | Description

-------------+-------------------+-----------+----------+--------------+-------------

 epoch_time  | bigint            |           | plain    |              |

 si_id       | character varying |           | extended |              |

 mi_id       | character varying |           | extended |              |

 si_hdr      | text              |           | extended |              |

 si_sn       | text              |           | extended |              |

 mi_hdr      | text              |           | extended |              |

 mi_sn       | text              |           | extended |              |

 mmid        | text              |           | extended |              |

 enid        | text              |           | extended |              |



-- merge distinct from the 4 valid tables and select distinct rows from that

create table data_distinct as (

    select 

        distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn

        from (

            select *

                from data_valid01

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid02

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid03

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            union all

            select *

                from data_valid04

                where epoch_time >= 1510202400000 and epoch_time <= 1510224000000

            ) as a

);





my_db_201710=# d+ data_distinct

         Table "public.data_distinct"

   Column    | Type | Modifiers | Storage  | Stats target | Description

-------------+------+-----------+----------+--------------+-------------

 mmid        | text |           | extended |              |

 enid        | text |           | extended |              |

 si_hdr      | text |           | extended |              |

 si_sn       | text |           | extended |              |

 mi_hdr      | text |           | extended |              |

 mi_sn       | text |           | extended |              |



-- concatenate some cols, hash * concatenate some other colums 

create extension pgcrypto;

create table data_final as (

    SELECT id, simi

    FROM

    (

        SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt

        FROM

        (

            select

                mmid || ',' || enid as Id,

                si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||

                    mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi

                from data_distinct

        ) t1

    ) t2

    where cnt = 1

); 



my_db_201710=# d+ data_final

            Table "public.data_final"

  Column  | Type | Modifiers | Storage  | Stats target | Description

----------+------+-----------+----------+--------------+-------------

 id       | text |           | extended |              |

 simi     | text |           | extended |              |





my_db_201710=#

postgresql

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

edited Nov 26 '17 at 7:02

asked Nov 25 '17 at 9:23

user3206440

1062

asked Nov 25 '17 at 9:23

user3206440

1062

asked Nov 25 '17 at 9:23

user3206440

1062

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

add a comment |

1 Answer
1

active

oldest

votes

Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.

Publish the full schema and data samples if you want more people work on this question.

publish the postgresql.conf

Follow the official PostgreSQL Performance hints.

Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.

If you don't need durability and crash safety, follow Non-Durable Settings

Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.

Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.

Before loading, remove all but absolutely required indexes. Recreate them after loading.

Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.

answered Nov 25 '17 at 15:43

filiprem

3,26111226

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f191703%2fimprove-postgres-query-perfromance-on-windows-with-billions-of-records-tb-data%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.

Publish the full schema and data samples if you want more people work on this question.

publish the postgresql.conf

Follow the official PostgreSQL Performance hints.

Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.

If you don't need durability and crash safety, follow Non-Durable Settings

Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.

Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.

Before loading, remove all but absolutely required indexes. Recreate them after loading.

Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.

answered Nov 25 '17 at 15:43

filiprem

3,26111226

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

add a comment |

Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.

Publish the full schema and data samples if you want more people work on this question.

publish the postgresql.conf

Follow the official PostgreSQL Performance hints.

Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.

If you don't need durability and crash safety, follow Non-Durable Settings

Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.

Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.

Before loading, remove all but absolutely required indexes. Recreate them after loading.

Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.

answered Nov 25 '17 at 15:43

filiprem

3,26111226

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

add a comment |

Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.

Publish the full schema and data samples if you want more people work on this question.

publish the postgresql.conf

Follow the official PostgreSQL Performance hints.

Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.

If you don't need durability and crash safety, follow Non-Durable Settings

Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.

Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.

Before loading, remove all but absolutely required indexes. Recreate them after loading.

Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.

answered Nov 25 '17 at 15:43

filiprem

3,26111226

Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.

Publish the full schema and data samples if you want more people work on this question.

publish the postgresql.conf

Follow the official PostgreSQL Performance hints.

Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.

If you don't need durability and crash safety, follow Non-Durable Settings

Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.

Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.

Before loading, remove all but absolutely required indexes. Recreate them after loading.

Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.

answered Nov 25 '17 at 15:43

filiprem

3,26111226

answered Nov 25 '17 at 15:43

filiprem

3,26111226

answered Nov 25 '17 at 15:43

filiprem

3,26111226

answered Nov 25 '17 at 15:43

filiprem

3,26111226

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

add a comment |

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

as suggested I have added the full schema and the queries here.

– user3206440
Nov 26 '17 at 6:47

Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

– filiprem
Nov 28 '17 at 11:02

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

TnZZnb,9H jnxAMUapOEOoIzv,x3 n 3eanZ sgwHt GA1 R jSbmS9JlpAWRb0LkFg20nYvXD7RY 7q,sZg2IlpnyoV6

搜尋此網誌

Xrhrft