improve postgres query perfromance on windows with billions of records, TB data size












1















using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive



I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows





  1. Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files


  2. Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.


  3. Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables


  4. Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs


  5. Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs


Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?



Also one of the anomalies I see in pgAdmin which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.



update



As suggested by @filiprem , adding the schema and query operations below



-- db tables
my_db_201710=# d
List of relations
Schema | Name | Type | Owner
--------+----------------------------------------+-------+----------
public | data_final | table | postgres -- ~450 mill rows, ~50 GB
public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
(10 rows)

-- import csv data into postgres
--------------
@echo off
setlocal
set PGPASSWORD=admin

for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"

pause
endlocal
--------------

my_db_201710=# d+ data_raw01
Table "public.data_raw01"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------------+-------------------+-----------+----------+--------------+-------------
type | integer | | plain | |
time | bigint | | plain | |
msg_type | integer | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
ti_id | character varying | | extended | |
session_id | character varying | | extended | |
mip_address | character varying | | extended | |
eip_address | character varying | | extended | |
mm_id | integer | | plain | |
en_id | integer | | plain | |

-- select valid records, columns with some column transformations
create table data_valid01 as (
select report_data_time as epoch_time, mi_id, si_id,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
substring(si_id, 1,5 ) as si_hdr,
substring(si_id, 6,10 ) as si_sn,
lpad(to_hex(mm_id),8,'0') as mmid,
lpad(to_hex(en_id), 6, '0') as enid
from data_raw01
where mi_id is not null and mm_id is not null and en_id is not null
) ;

my_db_201710=# d+ data_valid01
Table "public.data_valid01"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------------------+-----------+----------+--------------+-------------
epoch_time | bigint | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
mmid | text | | extended | |
enid | text | | extended | |

-- merge distinct from the 4 valid tables and select distinct rows from that
create table data_distinct as (
select
distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
from (
select *
from data_valid01
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid02
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid03
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid04
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
) as a
);


my_db_201710=# d+ data_distinct
Table "public.data_distinct"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------+-----------+----------+--------------+-------------
mmid | text | | extended | |
enid | text | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |

-- concatenate some cols, hash * concatenate some other colums
create extension pgcrypto;
create table data_final as (
SELECT id, simi
FROM
(
SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
FROM
(
select
mmid || ',' || enid as Id,
si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
from data_distinct
) t1
) t2
where cnt = 1
);

my_db_201710=# d+ data_final
Table "public.data_final"
Column | Type | Modifiers | Storage | Stats target | Description
----------+------+-----------+----------+--------------+-------------
id | text | | extended | |
simi | text | | extended | |


my_db_201710=#









share|improve this question
















bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.




















    1















    using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive



    I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows





    1. Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files


    2. Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.


    3. Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables


    4. Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs


    5. Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs


    Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?



    Also one of the anomalies I see in pgAdmin which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.



    update



    As suggested by @filiprem , adding the schema and query operations below



    -- db tables
    my_db_201710=# d
    List of relations
    Schema | Name | Type | Owner
    --------+----------------------------------------+-------+----------
    public | data_final | table | postgres -- ~450 mill rows, ~50 GB
    public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
    public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
    public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
    public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
    public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
    public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
    public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
    public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
    public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
    (10 rows)

    -- import csv data into postgres
    --------------
    @echo off
    setlocal
    set PGPASSWORD=admin

    for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"

    pause
    endlocal
    --------------

    my_db_201710=# d+ data_raw01
    Table "public.data_raw01"
    Column | Type | Modifiers | Storage | Stats target | Description
    -----------------------------------------+-------------------+-----------+----------+--------------+-------------
    type | integer | | plain | |
    time | bigint | | plain | |
    msg_type | integer | | plain | |
    si_id | character varying | | extended | |
    mi_id | character varying | | extended | |
    ti_id | character varying | | extended | |
    session_id | character varying | | extended | |
    mip_address | character varying | | extended | |
    eip_address | character varying | | extended | |
    mm_id | integer | | plain | |
    en_id | integer | | plain | |

    -- select valid records, columns with some column transformations
    create table data_valid01 as (
    select report_data_time as epoch_time, mi_id, si_id,
    substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
    substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
    substring(si_id, 1,5 ) as si_hdr,
    substring(si_id, 6,10 ) as si_sn,
    lpad(to_hex(mm_id),8,'0') as mmid,
    lpad(to_hex(en_id), 6, '0') as enid
    from data_raw01
    where mi_id is not null and mm_id is not null and en_id is not null
    ) ;

    my_db_201710=# d+ data_valid01
    Table "public.data_valid01"
    Column | Type | Modifiers | Storage | Stats target | Description
    -------------+-------------------+-----------+----------+--------------+-------------
    epoch_time | bigint | | plain | |
    si_id | character varying | | extended | |
    mi_id | character varying | | extended | |
    si_hdr | text | | extended | |
    si_sn | text | | extended | |
    mi_hdr | text | | extended | |
    mi_sn | text | | extended | |
    mmid | text | | extended | |
    enid | text | | extended | |

    -- merge distinct from the 4 valid tables and select distinct rows from that
    create table data_distinct as (
    select
    distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
    from (
    select *
    from data_valid01
    where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
    union all
    select *
    from data_valid02
    where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
    union all
    select *
    from data_valid03
    where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
    union all
    select *
    from data_valid04
    where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
    ) as a
    );


    my_db_201710=# d+ data_distinct
    Table "public.data_distinct"
    Column | Type | Modifiers | Storage | Stats target | Description
    -------------+------+-----------+----------+--------------+-------------
    mmid | text | | extended | |
    enid | text | | extended | |
    si_hdr | text | | extended | |
    si_sn | text | | extended | |
    mi_hdr | text | | extended | |
    mi_sn | text | | extended | |

    -- concatenate some cols, hash * concatenate some other colums
    create extension pgcrypto;
    create table data_final as (
    SELECT id, simi
    FROM
    (
    SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
    FROM
    (
    select
    mmid || ',' || enid as Id,
    si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
    mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
    from data_distinct
    ) t1
    ) t2
    where cnt = 1
    );

    my_db_201710=# d+ data_final
    Table "public.data_final"
    Column | Type | Modifiers | Storage | Stats target | Description
    ----------+------+-----------+----------+--------------+-------------
    id | text | | extended | |
    simi | text | | extended | |


    my_db_201710=#









    share|improve this question
















    bumped to the homepage by Community 3 mins ago


    This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.


















      1












      1








      1








      using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive



      I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows





      1. Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files


      2. Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.


      3. Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables


      4. Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs


      5. Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs


      Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?



      Also one of the anomalies I see in pgAdmin which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.



      update



      As suggested by @filiprem , adding the schema and query operations below



      -- db tables
      my_db_201710=# d
      List of relations
      Schema | Name | Type | Owner
      --------+----------------------------------------+-------+----------
      public | data_final | table | postgres -- ~450 mill rows, ~50 GB
      public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
      public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
      (10 rows)

      -- import csv data into postgres
      --------------
      @echo off
      setlocal
      set PGPASSWORD=admin

      for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"

      pause
      endlocal
      --------------

      my_db_201710=# d+ data_raw01
      Table "public.data_raw01"
      Column | Type | Modifiers | Storage | Stats target | Description
      -----------------------------------------+-------------------+-----------+----------+--------------+-------------
      type | integer | | plain | |
      time | bigint | | plain | |
      msg_type | integer | | plain | |
      si_id | character varying | | extended | |
      mi_id | character varying | | extended | |
      ti_id | character varying | | extended | |
      session_id | character varying | | extended | |
      mip_address | character varying | | extended | |
      eip_address | character varying | | extended | |
      mm_id | integer | | plain | |
      en_id | integer | | plain | |

      -- select valid records, columns with some column transformations
      create table data_valid01 as (
      select report_data_time as epoch_time, mi_id, si_id,
      substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
      substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
      substring(si_id, 1,5 ) as si_hdr,
      substring(si_id, 6,10 ) as si_sn,
      lpad(to_hex(mm_id),8,'0') as mmid,
      lpad(to_hex(en_id), 6, '0') as enid
      from data_raw01
      where mi_id is not null and mm_id is not null and en_id is not null
      ) ;

      my_db_201710=# d+ data_valid01
      Table "public.data_valid01"
      Column | Type | Modifiers | Storage | Stats target | Description
      -------------+-------------------+-----------+----------+--------------+-------------
      epoch_time | bigint | | plain | |
      si_id | character varying | | extended | |
      mi_id | character varying | | extended | |
      si_hdr | text | | extended | |
      si_sn | text | | extended | |
      mi_hdr | text | | extended | |
      mi_sn | text | | extended | |
      mmid | text | | extended | |
      enid | text | | extended | |

      -- merge distinct from the 4 valid tables and select distinct rows from that
      create table data_distinct as (
      select
      distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
      from (
      select *
      from data_valid01
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid02
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid03
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid04
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      ) as a
      );


      my_db_201710=# d+ data_distinct
      Table "public.data_distinct"
      Column | Type | Modifiers | Storage | Stats target | Description
      -------------+------+-----------+----------+--------------+-------------
      mmid | text | | extended | |
      enid | text | | extended | |
      si_hdr | text | | extended | |
      si_sn | text | | extended | |
      mi_hdr | text | | extended | |
      mi_sn | text | | extended | |

      -- concatenate some cols, hash * concatenate some other colums
      create extension pgcrypto;
      create table data_final as (
      SELECT id, simi
      FROM
      (
      SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
      FROM
      (
      select
      mmid || ',' || enid as Id,
      si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
      mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
      from data_distinct
      ) t1
      ) t2
      where cnt = 1
      );

      my_db_201710=# d+ data_final
      Table "public.data_final"
      Column | Type | Modifiers | Storage | Stats target | Description
      ----------+------+-----------+----------+--------------+-------------
      id | text | | extended | |
      simi | text | | extended | |


      my_db_201710=#









      share|improve this question
















      using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive



      I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows





      1. Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files


      2. Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - say table01, table02, table03, table04 - I use a batch file to get the list of csv files in each set and use psql -c "copy .. command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.


      3. Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (say table01_valid, table02_valid, table03_valid, table04_valid) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables


      4. Select distinct records and merge - in this step the distinct rows from each table is selected, merge them via union all and a final distinct rows from the merged records is selected - this is saved into a new table say table_valid_distinct_1234 - the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs


      5. Run transformations on some columns - in this step for each row, two columns are concatenated and selected as id field , two other columns are hashed, then concatenated and selected as name field. A final select is performed on this set to remove any ids with more than one unique name - the table size is ~50GB with ~450mill rows- this operation takes ~5hrs


      Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?



      Also one of the anomalies I see in pgAdmin which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.



      update



      As suggested by @filiprem , adding the schema and query operations below



      -- db tables
      my_db_201710=# d
      List of relations
      Schema | Name | Type | Owner
      --------+----------------------------------------+-------+----------
      public | data_final | table | postgres -- ~450 mill rows, ~50 GB
      public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
      public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
      public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
      public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
      (10 rows)

      -- import csv data into postgres
      --------------
      @echo off
      setlocal
      set PGPASSWORD=admin

      for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"

      pause
      endlocal
      --------------

      my_db_201710=# d+ data_raw01
      Table "public.data_raw01"
      Column | Type | Modifiers | Storage | Stats target | Description
      -----------------------------------------+-------------------+-----------+----------+--------------+-------------
      type | integer | | plain | |
      time | bigint | | plain | |
      msg_type | integer | | plain | |
      si_id | character varying | | extended | |
      mi_id | character varying | | extended | |
      ti_id | character varying | | extended | |
      session_id | character varying | | extended | |
      mip_address | character varying | | extended | |
      eip_address | character varying | | extended | |
      mm_id | integer | | plain | |
      en_id | integer | | plain | |

      -- select valid records, columns with some column transformations
      create table data_valid01 as (
      select report_data_time as epoch_time, mi_id, si_id,
      substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
      substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
      substring(si_id, 1,5 ) as si_hdr,
      substring(si_id, 6,10 ) as si_sn,
      lpad(to_hex(mm_id),8,'0') as mmid,
      lpad(to_hex(en_id), 6, '0') as enid
      from data_raw01
      where mi_id is not null and mm_id is not null and en_id is not null
      ) ;

      my_db_201710=# d+ data_valid01
      Table "public.data_valid01"
      Column | Type | Modifiers | Storage | Stats target | Description
      -------------+-------------------+-----------+----------+--------------+-------------
      epoch_time | bigint | | plain | |
      si_id | character varying | | extended | |
      mi_id | character varying | | extended | |
      si_hdr | text | | extended | |
      si_sn | text | | extended | |
      mi_hdr | text | | extended | |
      mi_sn | text | | extended | |
      mmid | text | | extended | |
      enid | text | | extended | |

      -- merge distinct from the 4 valid tables and select distinct rows from that
      create table data_distinct as (
      select
      distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
      from (
      select *
      from data_valid01
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid02
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid03
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      union all
      select *
      from data_valid04
      where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
      ) as a
      );


      my_db_201710=# d+ data_distinct
      Table "public.data_distinct"
      Column | Type | Modifiers | Storage | Stats target | Description
      -------------+------+-----------+----------+--------------+-------------
      mmid | text | | extended | |
      enid | text | | extended | |
      si_hdr | text | | extended | |
      si_sn | text | | extended | |
      mi_hdr | text | | extended | |
      mi_sn | text | | extended | |

      -- concatenate some cols, hash * concatenate some other colums
      create extension pgcrypto;
      create table data_final as (
      SELECT id, simi
      FROM
      (
      SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
      FROM
      (
      select
      mmid || ',' || enid as Id,
      si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
      mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
      from data_distinct
      ) t1
      ) t2
      where cnt = 1
      );

      my_db_201710=# d+ data_final
      Table "public.data_final"
      Column | Type | Modifiers | Storage | Stats target | Description
      ----------+------+-----------+----------+--------------+-------------
      id | text | | extended | |
      simi | text | | extended | |


      my_db_201710=#






      postgresql






      share|improve this question















      share|improve this question













      share|improve this question




      share|improve this question








      edited Nov 26 '17 at 7:02







      user3206440

















      asked Nov 25 '17 at 9:23









      user3206440user3206440

      1062




      1062





      bumped to the homepage by Community 3 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







      bumped to the homepage by Community 3 mins ago


      This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
























          1 Answer
          1






          active

          oldest

          votes


















          0














          Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.




          • Publish the full schema and data samples if you want more people work on this question.


          • publish the postgresql.conf


          • Follow the official PostgreSQL Performance hints.


          • Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.


          • If you don't need durability and crash safety, follow Non-Durable Settings


          • Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.


          • Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.


          • Before loading, remove all but absolutely required indexes. Recreate them after loading.


          • Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.







          share|improve this answer
























          • as suggested I have added the full schema and the queries here.

            – user3206440
            Nov 26 '17 at 6:47











          • Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

            – filiprem
            Nov 28 '17 at 11:02











          Your Answer








          StackExchange.ready(function() {
          var channelOptions = {
          tags: "".split(" "),
          id: "182"
          };
          initTagRenderer("".split(" "), "".split(" "), channelOptions);

          StackExchange.using("externalEditor", function() {
          // Have to fire editor after snippets, if snippets enabled
          if (StackExchange.settings.snippets.snippetsEnabled) {
          StackExchange.using("snippets", function() {
          createEditor();
          });
          }
          else {
          createEditor();
          }
          });

          function createEditor() {
          StackExchange.prepareEditor({
          heartbeatType: 'answer',
          autoActivateHeartbeat: false,
          convertImagesToLinks: false,
          noModals: true,
          showLowRepImageUploadWarning: true,
          reputationToPostImages: null,
          bindNavPrevention: true,
          postfix: "",
          imageUploader: {
          brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
          contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
          allowUrls: true
          },
          onDemand: true,
          discardSelector: ".discard-answer"
          ,immediatelyShowMarkdownHelp:true
          });


          }
          });














          draft saved

          draft discarded


















          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f191703%2fimprove-postgres-query-perfromance-on-windows-with-billions-of-records-tb-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown

























          1 Answer
          1






          active

          oldest

          votes








          1 Answer
          1






          active

          oldest

          votes









          active

          oldest

          votes






          active

          oldest

          votes









          0














          Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.




          • Publish the full schema and data samples if you want more people work on this question.


          • publish the postgresql.conf


          • Follow the official PostgreSQL Performance hints.


          • Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.


          • If you don't need durability and crash safety, follow Non-Durable Settings


          • Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.


          • Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.


          • Before loading, remove all but absolutely required indexes. Recreate them after loading.


          • Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.







          share|improve this answer
























          • as suggested I have added the full schema and the queries here.

            – user3206440
            Nov 26 '17 at 6:47











          • Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

            – filiprem
            Nov 28 '17 at 11:02
















          0














          Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.




          • Publish the full schema and data samples if you want more people work on this question.


          • publish the postgresql.conf


          • Follow the official PostgreSQL Performance hints.


          • Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.


          • If you don't need durability and crash safety, follow Non-Durable Settings


          • Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.


          • Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.


          • Before loading, remove all but absolutely required indexes. Recreate them after loading.


          • Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.







          share|improve this answer
























          • as suggested I have added the full schema and the queries here.

            – user3206440
            Nov 26 '17 at 6:47











          • Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

            – filiprem
            Nov 28 '17 at 11:02














          0












          0








          0







          Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.




          • Publish the full schema and data samples if you want more people work on this question.


          • publish the postgresql.conf


          • Follow the official PostgreSQL Performance hints.


          • Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.


          • If you don't need durability and crash safety, follow Non-Durable Settings


          • Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.


          • Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.


          • Before loading, remove all but absolutely required indexes. Recreate them after loading.


          • Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.







          share|improve this answer













          Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.




          • Publish the full schema and data samples if you want more people work on this question.


          • publish the postgresql.conf


          • Follow the official PostgreSQL Performance hints.


          • Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.


          • If you don't need durability and crash safety, follow Non-Durable Settings


          • Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.


          • Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.


          • Before loading, remove all but absolutely required indexes. Recreate them after loading.


          • Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.








          share|improve this answer












          share|improve this answer



          share|improve this answer










          answered Nov 25 '17 at 15:43









          filipremfiliprem

          3,26111226




          3,26111226













          • as suggested I have added the full schema and the queries here.

            – user3206440
            Nov 26 '17 at 6:47











          • Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

            – filiprem
            Nov 28 '17 at 11:02



















          • as suggested I have added the full schema and the queries here.

            – user3206440
            Nov 26 '17 at 6:47











          • Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

            – filiprem
            Nov 28 '17 at 11:02

















          as suggested I have added the full schema and the queries here.

          – user3206440
          Nov 26 '17 at 6:47





          as suggested I have added the full schema and the queries here.

          – user3206440
          Nov 26 '17 at 6:47













          Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

          – filiprem
          Nov 28 '17 at 11:02





          Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.

          – filiprem
          Nov 28 '17 at 11:02


















          draft saved

          draft discarded




















































          Thanks for contributing an answer to Database Administrators Stack Exchange!


          • Please be sure to answer the question. Provide details and share your research!

          But avoid



          • Asking for help, clarification, or responding to other answers.

          • Making statements based on opinion; back them up with references or personal experience.


          To learn more, see our tips on writing great answers.




          draft saved


          draft discarded














          StackExchange.ready(
          function () {
          StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f191703%2fimprove-postgres-query-perfromance-on-windows-with-billions-of-records-tb-data%23new-answer', 'question_page');
          }
          );

          Post as a guest















          Required, but never shown





















































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown

































          Required, but never shown














          Required, but never shown












          Required, but never shown







          Required, but never shown







          Popular posts from this blog

          الفوسفات في المغرب

          Four equal circles intersect: What is the area of the small shaded portion and its height

          بطل الاتحاد السوفيتي