Removing Outliers Grouping by Minute
I have a very simple table storing the following fields per http request:
- id serial
- url varchar
- date varchar (yyyy-mm-dd)
- time varchar (hh:mm)
- latency integer
And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.
So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.
My query is as follows:
select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime
However it is taking too much time.
Things I've already know/done:
- url like requires a full scan regardless of index
- mdate and mtime could be one column using datetime value
- I have index for almost everything (mdate,mtime,latency)
- ajusting work_mem parameters
The table itself has around 100M rows at total.
Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.
Anything that could help me improve it?
I am using postgres 10.
Thanks and Regards
postgresql query
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
add a comment |
I have a very simple table storing the following fields per http request:
- id serial
- url varchar
- date varchar (yyyy-mm-dd)
- time varchar (hh:mm)
- latency integer
And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.
So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.
My query is as follows:
select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime
However it is taking too much time.
Things I've already know/done:
- url like requires a full scan regardless of index
- mdate and mtime could be one column using datetime value
- I have index for almost everything (mdate,mtime,latency)
- ajusting work_mem parameters
The table itself has around 100M rows at total.
Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.
Anything that could help me improve it?
I am using postgres 10.
Thanks and Regards
postgresql query
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16
add a comment |
I have a very simple table storing the following fields per http request:
- id serial
- url varchar
- date varchar (yyyy-mm-dd)
- time varchar (hh:mm)
- latency integer
And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.
So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.
My query is as follows:
select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime
However it is taking too much time.
Things I've already know/done:
- url like requires a full scan regardless of index
- mdate and mtime could be one column using datetime value
- I have index for almost everything (mdate,mtime,latency)
- ajusting work_mem parameters
The table itself has around 100M rows at total.
Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.
Anything that could help me improve it?
I am using postgres 10.
Thanks and Regards
postgresql query
I have a very simple table storing the following fields per http request:
- id serial
- url varchar
- date varchar (yyyy-mm-dd)
- time varchar (hh:mm)
- latency integer
And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.
So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.
My query is as follows:
select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime
However it is taking too much time.
Things I've already know/done:
- url like requires a full scan regardless of index
- mdate and mtime could be one column using datetime value
- I have index for almost everything (mdate,mtime,latency)
- ajusting work_mem parameters
The table itself has around 100M rows at total.
Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.
Anything that could help me improve it?
I am using postgres 10.
Thanks and Regards
postgresql query
postgresql query
edited Aug 5 '18 at 17:40
giscard.faria
asked Aug 5 '18 at 17:32
giscard.fariagiscard.faria
1113
1113
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
bumped to the homepage by Community♦ 3 mins ago
This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16
add a comment |
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16
add a comment |
1 Answer
1
active
oldest
votes
It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.
I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:
select
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime
As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.
Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.
Thanks all.
Regards
add a comment |
Your Answer
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "182"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f214101%2fremoving-outliers-grouping-by-minute%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.
I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:
select
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime
As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.
Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.
Thanks all.
Regards
add a comment |
It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.
I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:
select
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime
As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.
Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.
Thanks all.
Regards
add a comment |
It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.
I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:
select
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime
As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.
Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.
Thanks all.
Regards
It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.
I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:
select
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime
As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.
Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.
Thanks all.
Regards
answered Aug 6 '18 at 0:13
giscard.fariagiscard.faria
1113
1113
add a comment |
add a comment |
Thanks for contributing an answer to Database Administrators Stack Exchange!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f214101%2fremoving-outliers-grouping-by-minute%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…
– Linas
Aug 5 '18 at 18:31
Please show us the execution plan, perhaps using explain.depesz.com to host it.
– jjanes
Aug 5 '18 at 19:16