improve postgres query perfromance on windows with billions of records, TB data size
using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive
I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows
Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files
Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - saytable01
,table02
,table03
,table04
- I use a batch file to get the list of csv files in each set and usepsql -c "copy ..
command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.
Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (saytable01_valid
,table02_valid
,table03_valid
,table04_valid
) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables
Select distinct records and merge - in this step the distinct rows from each table is selected, merge them viaunion all
and a final distinct rows from the merged records is selected - this is saved into a new table saytable_valid_distinct_1234
- the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs
Run transformations on some columns - in this step for each row, two columns are concatenated and selected asid
field , two other columns are hashed, then concatenated and selected asname
field. A final select is performed on this set to remove anyid
s with more than one uniquename
- the table size is ~50GB with ~450mill rows- this operation takes ~5hrs
Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?
Also one of the anomalies I see in pgAdmin
which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.
update
As suggested by @filiprem , adding the schema and query operations below
-- db tables
my_db_201710=# d
List of relations
Schema | Name | Type | Owner
--------+----------------------------------------+-------+----------
public | data_final | table | postgres -- ~450 mill rows, ~50 GB
public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
(10 rows)
-- import csv data into postgres
--------------
@echo off
setlocal
set PGPASSWORD=admin
for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"
pause
endlocal
--------------
my_db_201710=# d+ data_raw01
Table "public.data_raw01"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------------+-------------------+-----------+----------+--------------+-------------
type | integer | | plain | |
time | bigint | | plain | |
msg_type | integer | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
ti_id | character varying | | extended | |
session_id | character varying | | extended | |
mip_address | character varying | | extended | |
eip_address | character varying | | extended | |
mm_id | integer | | plain | |
en_id | integer | | plain | |
-- select valid records, columns with some column transformations
create table data_valid01 as (
select report_data_time as epoch_time, mi_id, si_id,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
substring(si_id, 1,5 ) as si_hdr,
substring(si_id, 6,10 ) as si_sn,
lpad(to_hex(mm_id),8,'0') as mmid,
lpad(to_hex(en_id), 6, '0') as enid
from data_raw01
where mi_id is not null and mm_id is not null and en_id is not null
) ;
my_db_201710=# d+ data_valid01
Table "public.data_valid01"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------------------+-----------+----------+--------------+-------------
epoch_time | bigint | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
mmid | text | | extended | |
enid | text | | extended | |
-- merge distinct from the 4 valid tables and select distinct rows from that
create table data_distinct as (
select
distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
from (
select *
from data_valid01
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid02
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid03
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid04
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
) as a
);
my_db_201710=# d+ data_distinct
Table "public.data_distinct"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------+-----------+----------+--------------+-------------
mmid | text | | extended | |
enid | text | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
-- concatenate some cols, hash * concatenate some other colums
create extension pgcrypto;
create table data_final as (
SELECT id, simi
FROM
(
SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
FROM
(
select
mmid || ',' || enid as Id,
si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
from data_distinct
) t1
) t2
where cnt = 1
);
my_db_201710=# d+ data_final
Table "public.data_final"
Column | Type | Modifiers | Storage | Stats target | Description
----------+------+-----------+----------+--------------+-------------
id | text | | extended | |
simi | text | | extended | |
my_db_201710=#
postgresql
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive
I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows
Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files
Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - saytable01
,table02
,table03
,table04
- I use a batch file to get the list of csv files in each set and usepsql -c "copy ..
command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.
Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (saytable01_valid
,table02_valid
,table03_valid
,table04_valid
) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables
Select distinct records and merge - in this step the distinct rows from each table is selected, merge them viaunion all
and a final distinct rows from the merged records is selected - this is saved into a new table saytable_valid_distinct_1234
- the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs
Run transformations on some columns - in this step for each row, two columns are concatenated and selected asid
field , two other columns are hashed, then concatenated and selected asname
field. A final select is performed on this set to remove anyid
s with more than one uniquename
- the table size is ~50GB with ~450mill rows- this operation takes ~5hrs
Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?
Also one of the anomalies I see in pgAdmin
which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.
update
As suggested by @filiprem , adding the schema and query operations below
-- db tables
my_db_201710=# d
List of relations
Schema | Name | Type | Owner
--------+----------------------------------------+-------+----------
public | data_final | table | postgres -- ~450 mill rows, ~50 GB
public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
(10 rows)
-- import csv data into postgres
--------------
@echo off
setlocal
set PGPASSWORD=admin
for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"
pause
endlocal
--------------
my_db_201710=# d+ data_raw01
Table "public.data_raw01"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------------+-------------------+-----------+----------+--------------+-------------
type | integer | | plain | |
time | bigint | | plain | |
msg_type | integer | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
ti_id | character varying | | extended | |
session_id | character varying | | extended | |
mip_address | character varying | | extended | |
eip_address | character varying | | extended | |
mm_id | integer | | plain | |
en_id | integer | | plain | |
-- select valid records, columns with some column transformations
create table data_valid01 as (
select report_data_time as epoch_time, mi_id, si_id,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
substring(si_id, 1,5 ) as si_hdr,
substring(si_id, 6,10 ) as si_sn,
lpad(to_hex(mm_id),8,'0') as mmid,
lpad(to_hex(en_id), 6, '0') as enid
from data_raw01
where mi_id is not null and mm_id is not null and en_id is not null
) ;
my_db_201710=# d+ data_valid01
Table "public.data_valid01"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------------------+-----------+----------+--------------+-------------
epoch_time | bigint | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
mmid | text | | extended | |
enid | text | | extended | |
-- merge distinct from the 4 valid tables and select distinct rows from that
create table data_distinct as (
select
distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
from (
select *
from data_valid01
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid02
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid03
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid04
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
) as a
);
my_db_201710=# d+ data_distinct
Table "public.data_distinct"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------+-----------+----------+--------------+-------------
mmid | text | | extended | |
enid | text | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
-- concatenate some cols, hash * concatenate some other colums
create extension pgcrypto;
create table data_final as (
SELECT id, simi
FROM
(
SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
FROM
(
select
mmid || ',' || enid as Id,
si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
from data_distinct
) t1
) t2
where cnt = 1
);
my_db_201710=# d+ data_final
Table "public.data_final"
Column | Type | Modifiers | Storage | Stats target | Description
----------+------+-----------+----------+--------------+-------------
id | text | | extended | |
simi | text | | extended | |
my_db_201710=#
postgresql
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive
I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows
Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files
Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - saytable01
,table02
,table03
,table04
- I use a batch file to get the list of csv files in each set and usepsql -c "copy ..
command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.
Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (saytable01_valid
,table02_valid
,table03_valid
,table04_valid
) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables
Select distinct records and merge - in this step the distinct rows from each table is selected, merge them viaunion all
and a final distinct rows from the merged records is selected - this is saved into a new table saytable_valid_distinct_1234
- the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs
Run transformations on some columns - in this step for each row, two columns are concatenated and selected asid
field , two other columns are hashed, then concatenated and selected asname
field. A final select is performed on this set to remove anyid
s with more than one uniquename
- the table size is ~50GB with ~450mill rows- this operation takes ~5hrs
Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?
Also one of the anomalies I see in pgAdmin
which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.
update
As suggested by @filiprem , adding the schema and query operations below
-- db tables
my_db_201710=# d
List of relations
Schema | Name | Type | Owner
--------+----------------------------------------+-------+----------
public | data_final | table | postgres -- ~450 mill rows, ~50 GB
public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
(10 rows)
-- import csv data into postgres
--------------
@echo off
setlocal
set PGPASSWORD=admin
for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"
pause
endlocal
--------------
my_db_201710=# d+ data_raw01
Table "public.data_raw01"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------------+-------------------+-----------+----------+--------------+-------------
type | integer | | plain | |
time | bigint | | plain | |
msg_type | integer | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
ti_id | character varying | | extended | |
session_id | character varying | | extended | |
mip_address | character varying | | extended | |
eip_address | character varying | | extended | |
mm_id | integer | | plain | |
en_id | integer | | plain | |
-- select valid records, columns with some column transformations
create table data_valid01 as (
select report_data_time as epoch_time, mi_id, si_id,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
substring(si_id, 1,5 ) as si_hdr,
substring(si_id, 6,10 ) as si_sn,
lpad(to_hex(mm_id),8,'0') as mmid,
lpad(to_hex(en_id), 6, '0') as enid
from data_raw01
where mi_id is not null and mm_id is not null and en_id is not null
) ;
my_db_201710=# d+ data_valid01
Table "public.data_valid01"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------------------+-----------+----------+--------------+-------------
epoch_time | bigint | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
mmid | text | | extended | |
enid | text | | extended | |
-- merge distinct from the 4 valid tables and select distinct rows from that
create table data_distinct as (
select
distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
from (
select *
from data_valid01
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid02
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid03
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid04
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
) as a
);
my_db_201710=# d+ data_distinct
Table "public.data_distinct"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------+-----------+----------+--------------+-------------
mmid | text | | extended | |
enid | text | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
-- concatenate some cols, hash * concatenate some other colums
create extension pgcrypto;
create table data_final as (
SELECT id, simi
FROM
(
SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
FROM
(
select
mmid || ',' || enid as Id,
si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
from data_distinct
) t1
) t2
where cnt = 1
);
my_db_201710=# d+ data_final
Table "public.data_final"
Column | Type | Modifiers | Storage | Stats target | Description
----------+------+-----------+----------+--------------+-------------
id | text | | extended | |
simi | text | | extended | |
my_db_201710=#
postgresql
using Postgres 9.6, pgAdmin 4 (for running the queries) on Windows 10 64-bit with Intel Core i7-6700 CPU @ 3.40GHz, 16GB RAM, 1 TB C drive HDD , 2 TB D drive HDD. Postgres data is stored in D drive
I'm using Postgres to process some dataset that I receive in csv format- the steps involved are as follows
Create table schema - 4 tables are created, one for each set of csv files with the same schema as per the fields in the input csv files
Import CSVs to Postgres DB - Import around 4 sets of 500 CSV files each of around 4 million rows into own tables - saytable01
,table02
,table03
,table04
- I use a batch file to get the list of csv files in each set and usepsql -c "copy ..
command to import them. This process takes around 30s per csv file to import - so altogether 1000 minutes or ~17hrs . Each table has ~2bill rows and around 350MB per table.
Create a subset of valid records and select relevant columns - For each table in #2 above, create a new table (saytable01_valid
,table02_valid
,table03_valid
,table04_valid
) which has a subset of columns and exclude some rows based on a criteria - after this operation the number of rows comes down to ~1.5bill per table with each table ~200MB in size - this operation takes ~20hrs to execute for the 4 tables
Select distinct records and merge - in this step the distinct rows from each table is selected, merge them viaunion all
and a final distinct rows from the merged records is selected - this is saved into a new table saytable_valid_distinct_1234
- the table size is ~450mill rows and ~35GB size - this operation takes ~12hrs
Run transformations on some columns - in this step for each row, two columns are concatenated and selected asid
field , two other columns are hashed, then concatenated and selected asname
field. A final select is performed on this set to remove anyid
s with more than one uniquename
- the table size is ~50GB with ~450mill rows- this operation takes ~5hrs
Given the above steps, I wanted to collect feedback on improving processing speed given the number of records, data size and the hardware I'm using. Ideas on improving performance and disk usage in any of the steps from 1 to 5 would be helpful. I could provide additional info needed. Also if there is some sort of standard process to investigate this, that would also be very helpful. If there are hardware related enhancements required - should I be looking at adding more memory, more RAM or something else ?
Also one of the anomalies I see in pgAdmin
which I use to run the queries is that it reports query run time as a few minutes for each of the above operations where as in reality it runs for a number of hours.
update
As suggested by @filiprem , adding the schema and query operations below
-- db tables
my_db_201710=# d
List of relations
Schema | Name | Type | Owner
--------+----------------------------------------+-------+----------
public | data_final | table | postgres -- ~450 mill rows, ~50 GB
public | data_distinct | table | postgres -- ~450 mill rows, ~35 GB
public | data_valid01 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw01 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid02 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw02 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid03 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw03 | table | postgres -- ~2 bill rows, ~350 GB
public | data_valid04 | table | postgres -- ~1.8 bill rows, ~200 GB
public | data_raw04 | table | postgres -- ~2 bill rows, ~350 GB
(10 rows)
-- import csv data into postgres
--------------
@echo off
setlocal
set PGPASSWORD=admin
for %%f in (*.csv) do psql -d my_db_201710 -h localhost -U postgres -p 5432 -c "COPY public.data_raw01 FROM '%%~dpnxf' DELIMITER ',' CSV;"
pause
endlocal
--------------
my_db_201710=# d+ data_raw01
Table "public.data_raw01"
Column | Type | Modifiers | Storage | Stats target | Description
-----------------------------------------+-------------------+-----------+----------+--------------+-------------
type | integer | | plain | |
time | bigint | | plain | |
msg_type | integer | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
ti_id | character varying | | extended | |
session_id | character varying | | extended | |
mip_address | character varying | | extended | |
eip_address | character varying | | extended | |
mm_id | integer | | plain | |
en_id | integer | | plain | |
-- select valid records, columns with some column transformations
create table data_valid01 as (
select report_data_time as epoch_time, mi_id, si_id,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 1,8 ) as mi_hdr,
substring(lpad(substring(mi_id, 1,14 ), 14, '0'), 9,6 ) as mi_sn,
substring(si_id, 1,5 ) as si_hdr,
substring(si_id, 6,10 ) as si_sn,
lpad(to_hex(mm_id),8,'0') as mmid,
lpad(to_hex(en_id), 6, '0') as enid
from data_raw01
where mi_id is not null and mm_id is not null and en_id is not null
) ;
my_db_201710=# d+ data_valid01
Table "public.data_valid01"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+-------------------+-----------+----------+--------------+-------------
epoch_time | bigint | | plain | |
si_id | character varying | | extended | |
mi_id | character varying | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
mmid | text | | extended | |
enid | text | | extended | |
-- merge distinct from the 4 valid tables and select distinct rows from that
create table data_distinct as (
select
distinct mmid, enid, si_hdr, si_sn, mm_hdr, mm_sn
from (
select *
from data_valid01
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid02
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid03
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
union all
select *
from data_valid04
where epoch_time >= 1510202400000 and epoch_time <= 1510224000000
) as a
);
my_db_201710=# d+ data_distinct
Table "public.data_distinct"
Column | Type | Modifiers | Storage | Stats target | Description
-------------+------+-----------+----------+--------------+-------------
mmid | text | | extended | |
enid | text | | extended | |
si_hdr | text | | extended | |
si_sn | text | | extended | |
mi_hdr | text | | extended | |
mi_sn | text | | extended | |
-- concatenate some cols, hash * concatenate some other colums
create extension pgcrypto;
create table data_final as (
SELECT id, simi
FROM
(
SELECT id, simi, COUNT(*) OVER (PARTITION BY id) cnt
FROM
(
select
mmid || ',' || enid as Id,
si_hdr || encode(substring(digest(si_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') || ',' ||
mi_hdr || encode(substring(digest(mi_sn || 'Maria Sharapovan' , 'sha256') from 17), 'base64') as simi
from data_distinct
) t1
) t2
where cnt = 1
);
my_db_201710=# d+ data_final
Table "public.data_final"
Column | Type | Modifiers | Storage | Stats target | Description
----------+------+-----------+----------+--------------+-------------
id | text | | extended | |
simi | text | | extended | |
my_db_201710=#
postgresql
postgresql
edited Nov 26 '17 at 7:02
user3206440
asked Nov 25 '17 at 9:23
user3206440user3206440
1062
1062
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
add a comment |
1 Answer
1
active
oldest
votes
Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.
Publish the full schema and data samples if you want more people work on this question.
publish the postgresql.conf
Follow the official PostgreSQL Performance hints.
Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.
If you don't need durability and crash safety, follow Non-Durable Settings
Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.
Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.
Before loading, remove all but absolutely required indexes. Recreate them after loading.
Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f191703%2fimprove-postgres-query-perfromance-on-windows-with-billions-of-records-tb-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.
Publish the full schema and data samples if you want more people work on this question.
publish the postgresql.conf
Follow the official PostgreSQL Performance hints.
Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.
If you don't need durability and crash safety, follow Non-Durable Settings
Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.
Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.
Before loading, remove all but absolutely required indexes. Recreate them after loading.
Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
add a comment |
Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.
Publish the full schema and data samples if you want more people work on this question.
publish the postgresql.conf
Follow the official PostgreSQL Performance hints.
Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.
If you don't need durability and crash safety, follow Non-Durable Settings
Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.
Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.
Before loading, remove all but absolutely required indexes. Recreate them after loading.
Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
add a comment |
Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.
Publish the full schema and data samples if you want more people work on this question.
publish the postgresql.conf
Follow the official PostgreSQL Performance hints.
Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.
If you don't need durability and crash safety, follow Non-Durable Settings
Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.
Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.
Before loading, remove all but absolutely required indexes. Recreate them after loading.
Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.
Interesting question but also very open one. I'm putting a list of recommendations here - hope it helps.
Publish the full schema and data samples if you want more people work on this question.
publish the postgresql.conf
Follow the official PostgreSQL Performance hints.
Use parallel processing when possible. On modern 4-core desktop CPU you can start with 2-3 threads. Observe system CPU usage (kernel time vs user time) while running the import. COPY is CPU-intensive.
If you don't need durability and crash safety, follow Non-Durable Settings
Consider combining Step 2 and step 3 (filter on the fly) if you don't need the invaliid data in postgres.
Consider plain UPDATE instead of copying the same data into new tables. It might be faster or slower depending on the structure/constraints.
Before loading, remove all but absolutely required indexes. Recreate them after loading.
Re hardware investments, purchasing SSD drive will have highest impact/cost ratio.
answered Nov 25 '17 at 15:43
filipremfiliprem
3,26111226
3,26111226
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
add a comment |
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
as suggested I have added the full schema and the queries here.
– user3206440
Nov 26 '17 at 6:47
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
Thanks. It looks like not a single shot question but rather an optimization task which will take fair time to solve. Try to track your I/O and CPU during the process - this would be very valuable input.
– filiprem
Nov 28 '17 at 11:02
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f191703%2fimprove-postgres-query-perfromance-on-windows-with-billions-of-records-tb-data%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown