Removing Outliers Grouping by Minute












2















I have a very simple table storing the following fields per http request:




  • id serial

  • url varchar

  • date varchar (yyyy-mm-dd)

  • time varchar (hh:mm)

  • latency integer


And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.



So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.



My query is as follows:



select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime


However it is taking too much time.
Things I've already know/done:




  • url like requires a full scan regardless of index

  • mdate and mtime could be one column using datetime value

  • I have index for almost everything (mdate,mtime,latency)

  • ajusting work_mem parameters


The table itself has around 100M rows at total.



Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.



Anything that could help me improve it?



I am using postgres 10.



Thanks and Regards










share|improve this question
















bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

    – Linas
    Aug 5 '18 at 18:31











  • Please show us the execution plan, perhaps using explain.depesz.com to host it.

    – jjanes
    Aug 5 '18 at 19:16
















2















I have a very simple table storing the following fields per http request:




  • id serial

  • url varchar

  • date varchar (yyyy-mm-dd)

  • time varchar (hh:mm)

  • latency integer


And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.



So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.



My query is as follows:



select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime


However it is taking too much time.
Things I've already know/done:




  • url like requires a full scan regardless of index

  • mdate and mtime could be one column using datetime value

  • I have index for almost everything (mdate,mtime,latency)

  • ajusting work_mem parameters


The table itself has around 100M rows at total.



Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.



Anything that could help me improve it?



I am using postgres 10.



Thanks and Regards










share|improve this question
















bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.
















  • You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

    – Linas
    Aug 5 '18 at 18:31











  • Please show us the execution plan, perhaps using explain.depesz.com to host it.

    – jjanes
    Aug 5 '18 at 19:16














2












2








2








I have a very simple table storing the following fields per http request:




  • id serial

  • url varchar

  • date varchar (yyyy-mm-dd)

  • time varchar (hh:mm)

  • latency integer


And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.



So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.



My query is as follows:



select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime


However it is taking too much time.
Things I've already know/done:




  • url like requires a full scan regardless of index

  • mdate and mtime could be one column using datetime value

  • I have index for almost everything (mdate,mtime,latency)

  • ajusting work_mem parameters


The table itself has around 100M rows at total.



Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.



Anything that could help me improve it?



I am using postgres 10.



Thanks and Regards










share|improve this question
















I have a very simple table storing the following fields per http request:




  • id serial

  • url varchar

  • date varchar (yyyy-mm-dd)

  • time varchar (hh:mm)

  • latency integer


And I am trying to calculate average latency per minute, however if i have 10 requests (9 of 1s and 1 of 100s) then average becomes useless. In order to solve that I was wondering about remove outliers.



So, basically what I want is to calculate the average excluding request which latency is bigger then 99.7%, naturally I want to group those requests per minute in order to follow "average" latency per minute.



My query is as follows:



select mdate,mtime,avg(latency)
from mtable
join (
select mdate,mtime,avg(latency) + 3*stddev(latency) as "uband"
from mtable
where url like '%ThePartialUrl%'
group by, mdate, mtime
) as t
on mdate=t.mdate and mtime=t.mtime and latency < t.uband
where url like '%ThePartialUrl%'
group by mdate,mtime
order by mdate, mtime


However it is taking too much time.
Things I've already know/done:




  • url like requires a full scan regardless of index

  • mdate and mtime could be one column using datetime value

  • I have index for almost everything (mdate,mtime,latency)

  • ajusting work_mem parameters


The table itself has around 100M rows at total.



Looking at the execution plan it seems the nested loop (join) is the one taking most time since the subquery returns 40K rows (1 per minute during a whole month) and it shall be joined around to 10M rows for the specific '%ThePartialUrl%', resulting in 4 a billion loop.



Anything that could help me improve it?



I am using postgres 10.



Thanks and Regards







postgresql query






share|improve this question















share|improve this question













share|improve this question




share|improve this question








edited Aug 5 '18 at 17:40







giscard.faria

















asked Aug 5 '18 at 17:32









giscard.fariagiscard.faria

1113




1113





bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.







bumped to the homepage by Community 3 mins ago


This question has answers that may be good or bad; the system has marked it active so that they can be reviewed.















  • You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

    – Linas
    Aug 5 '18 at 18:31











  • Please show us the execution plan, perhaps using explain.depesz.com to host it.

    – jjanes
    Aug 5 '18 at 19:16



















  • You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

    – Linas
    Aug 5 '18 at 18:31











  • Please show us the execution plan, perhaps using explain.depesz.com to host it.

    – jjanes
    Aug 5 '18 at 19:16

















You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31





You're looking for what's called a trimmed mean: stats.stackexchange.com/questions/16198/…. Also have a look at: stackoverflow.com/questions/2934372/…

– Linas
Aug 5 '18 at 18:31













Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16





Please show us the execution plan, perhaps using explain.depesz.com to host it.

– jjanes
Aug 5 '18 at 19:16










1 Answer
1






active

oldest

votes


















0














It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.



I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:



select 
mdate,
mtime,
percentile_disc(0.997) withing group (order by latency) as "response"
from mtable
where url like '%MyPartialUrl%
group by mdate,mtime
order by mdate,mtime


As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.



Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.



Thanks all.
Regards






share|improve this answer























    Your Answer








    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "182"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });














    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f214101%2fremoving-outliers-grouping-by-minute%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    0














    It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.



    I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:



    select 
    mdate,
    mtime,
    percentile_disc(0.997) withing group (order by latency) as "response"
    from mtable
    where url like '%MyPartialUrl%
    group by mdate,mtime
    order by mdate,mtime


    As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.



    Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.



    Thanks all.
    Regards






    share|improve this answer




























      0














      It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.



      I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:



      select 
      mdate,
      mtime,
      percentile_disc(0.997) withing group (order by latency) as "response"
      from mtable
      where url like '%MyPartialUrl%
      group by mdate,mtime
      order by mdate,mtime


      As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.



      Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.



      Thanks all.
      Regards






      share|improve this answer


























        0












        0








        0







        It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.



        I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:



        select 
        mdate,
        mtime,
        percentile_disc(0.997) withing group (order by latency) as "response"
        from mtable
        where url like '%MyPartialUrl%
        group by mdate,mtime
        order by mdate,mtime


        As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.



        Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.



        Thanks all.
        Regards






        share|improve this answer













        It seems there is no alternative to solve the problem without using a sub query, what is a performance constraint in my case, so I decide move to another approach.



        I gave up about the trimmed means (thanks @Linas) and move to a percentile_disc indicator. Once all I wanted was to give a measure of latency for each request I did a simple query as follow:



        select 
        mdate,
        mtime,
        percentile_disc(0.997) withing group (order by latency) as "response"
        from mtable
        where url like '%MyPartialUrl%
        group by mdate,mtime
        order by mdate,mtime


        As this way I grab the biggest latency took to answer a request for 99.7% of all users. In fact I am grabbing several percentiles following six sigma approach to show how much slowness per million requests we are getting on production.



        Query took no more than 8 seconds to be executed, using the trimmed means before it was around 90 minutes.



        Thanks all.
        Regards







        share|improve this answer












        share|improve this answer



        share|improve this answer










        answered Aug 6 '18 at 0:13









        giscard.fariagiscard.faria

        1113




        1113






























            draft saved

            draft discarded




















































            Thanks for contributing an answer to Database Administrators Stack Exchange!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fdba.stackexchange.com%2fquestions%2f214101%2fremoving-outliers-grouping-by-minute%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            الفوسفات في المغرب

            Four equal circles intersect: What is the area of the small shaded portion and its height

            بطل الاتحاد السوفيتي