Removing Outliers Grouping by Minute

Multi tool use

I have a very simple table storing the following fields per http request:

id serial

url varchar

date varchar (yyyy-mm-dd)

time varchar (hh:mm)

latency integer

And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.

So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.

My query is as follows:

select mdate,mtime,avg(latency)

from mtable 

  join (

    select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband" 

    from mtable 

    where url like '%ThePartialUrl%' 

    group by, mdate, mtime

 ) as t 

 on mdate=t.mdate and mtime=t.mtime and latency < t.uband 

 where url like '%ThePartialUrl%'

 group by mdate,mtime 

 order by mdate, mtime

However it is taking too much time.
Things I've already know/done:

url like requires a full scan regardless of index

mdate and mtime could be one column using datetime value

I have index for almost everything (mdate,mtime,latency)

ajusting work_mem parameters

The table itself has around 100M rows at total.

Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.

Anything that could help me improve it?

I am using postgres 10.

Thanks and Regards

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31

Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16

add a comment |

I have a very simple table storing the following fields per http request:

id serial

url varchar

date varchar (yyyy-mm-dd)

time varchar (hh:mm)

latency integer

My query is as follows:

select mdate,mtime,avg(latency)

from mtable 

  join (

    select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband" 

    from mtable 

    where url like '%ThePartialUrl%' 

    group by, mdate, mtime

 ) as t 

 on mdate=t.mdate and mtime=t.mtime and latency < t.uband 

 where url like '%ThePartialUrl%'

 group by mdate,mtime 

 order by mdate, mtime

However it is taking too much time.
Things I've already know/done:

url like requires a full scan regardless of index

mdate and mtime could be one column using datetime value

I have index for almost everything (mdate,mtime,latency)

ajusting work_mem parameters

The table itself has around 100M rows at total.

Anything that could help me improve it?

I am using postgres 10.

Thanks and Regards

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31

Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16

add a comment |

I have a very simple table storing the following fields per http request:

id serial

url varchar

date varchar (yyyy-mm-dd)

time varchar (hh:mm)

latency integer

My query is as follows:

select mdate,mtime,avg(latency)

from mtable 

  join (

    select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband" 

    from mtable 

    where url like '%ThePartialUrl%' 

    group by, mdate, mtime

 ) as t 

 on mdate=t.mdate and mtime=t.mtime and latency < t.uband 

 where url like '%ThePartialUrl%'

 group by mdate,mtime 

 order by mdate, mtime

However it is taking too much time.
Things I've already know/done:

url like requires a full scan regardless of index

mdate and mtime could be one column using datetime value

I have index for almost everything (mdate,mtime,latency)

ajusting work_mem parameters

The table itself has around 100M rows at total.

Anything that could help me improve it?

I am using postgres 10.

Thanks and Regards

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

I have a very simple table storing the following fields per http request:

id serial

url varchar

date varchar (yyyy-mm-dd)

time varchar (hh:mm)

latency integer

My query is as follows:

select mdate,mtime,avg(latency)

from mtable 

  join (

    select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband" 

    from mtable 

    where url like '%ThePartialUrl%' 

    group by, mdate, mtime

 ) as t 

 on mdate=t.mdate and mtime=t.mtime and latency < t.uband 

 where url like '%ThePartialUrl%'

 group by mdate,mtime 

 order by mdate, mtime

However it is taking too much time.
Things I've already know/done:

url like requires a full scan regardless of index

mdate and mtime could be one column using datetime value

I have index for almost everything (mdate,mtime,latency)

ajusting work_mem parameters

The table itself has around 100M rows at total.

Anything that could help me improve it?

I am using postgres 10.

Thanks and Regards

postgresql query

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

edited Aug 5 '18 at 17:40

asked Aug 5 '18 at 17:32

giscard.faria

1113

asked Aug 5 '18 at 17:32

giscard.faria

1113

asked Aug 5 '18 at 17:32

giscard.faria

1113

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

bumped to the homepage by Community♦ 3 mins ago

This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.

You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31

Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16

add a comment |

You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31

Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16

You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31

Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16

add a comment |

1 Answer
1

active

oldest

votes

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 

  mdate,

  mtime,

  percentile_disc(0.997) withing group (order by latency) as "response" 

from mtable

where url like '%MyPartialUrl% 

group by mdate,mtime

order by mdate,mtime

As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all.
Regards

answered Aug 6 '18 at 0:13

giscard.faria

1113

add a comment |

Your Answer

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f214101%2fremoving-outliers-grouping-by-minute%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 

  mdate,

  mtime,

  percentile_disc(0.997) withing group (order by latency) as "response" 

from mtable

where url like '%MyPartialUrl% 

group by mdate,mtime

order by mdate,mtime

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all.
Regards

answered Aug 6 '18 at 0:13

giscard.faria

1113

add a comment |

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 

  mdate,

  mtime,

  percentile_disc(0.997) withing group (order by latency) as "response" 

from mtable

where url like '%MyPartialUrl% 

group by mdate,mtime

order by mdate,mtime

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all.
Regards

answered Aug 6 '18 at 0:13

giscard.faria

1113

add a comment |

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 

  mdate,

  mtime,

  percentile_disc(0.997) withing group (order by latency) as "response" 

from mtable

where url like '%MyPartialUrl% 

group by mdate,mtime

order by mdate,mtime

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all.
Regards

answered Aug 6 '18 at 0:13

giscard.faria

1113

It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.

I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:

select 

  mdate,

  mtime,

  percentile_disc(0.997) withing group (order by latency) as "response" 

from mtable

where url like '%MyPartialUrl% 

group by mdate,mtime

order by mdate,mtime

Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.

Thanks all.
Regards

answered Aug 6 '18 at 0:13

giscard.faria

1113

answered Aug 6 '18 at 0:13

giscard.faria

1113

answered Aug 6 '18 at 0:13

giscard.faria

1113

answered Aug 6 '18 at 0:13

giscard.faria

1113

add a comment |

draft saved

draft discarded

Thanks for contributing an answer to Database Administrators Stack Exchange!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

k,Kk,m7xt7 nC dXJ,vFbt we citK4BVG3PB6w2fmOzXI,WfrQ,oR6XZZk0veim576Z45YklB4r7VnC Vp0mBqF htS44Kc1C2jSOvup1

搜尋此網誌

Xrhrft