What if my linear regression data contains several co-mingled linear relationships?












2












$begingroup$


Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.



However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:



soil pH vs flower height (cm)



I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.



Questions:




  1. Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?


  2. How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?



I can think of some combinatorial approaches, but they seem computationally expensive.










share|cite|improve this question







New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    related stats.stackexchange.com/questions/245902/…
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    do you know which dafodil is which varaiety? If so, then you can just include that information into your model
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
    $endgroup$
    – rolando2
    1 hour ago
















2












$begingroup$


Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.



However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:



soil pH vs flower height (cm)



I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.



Questions:




  1. Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?


  2. How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?



I can think of some combinatorial approaches, but they seem computationally expensive.










share|cite|improve this question







New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$












  • $begingroup$
    related stats.stackexchange.com/questions/245902/…
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    do you know which dafodil is which varaiety? If so, then you can just include that information into your model
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
    $endgroup$
    – rolando2
    1 hour ago














2












2








2





$begingroup$


Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.



However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:



soil pH vs flower height (cm)



I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.



Questions:




  1. Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?


  2. How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?



I can think of some combinatorial approaches, but they seem computationally expensive.










share|cite|improve this question







New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.







$endgroup$




Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.



However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:



soil pH vs flower height (cm)



I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.



Questions:




  1. Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?


  2. How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?



I can think of some combinatorial approaches, but they seem computationally expensive.







regression linear-model dataset






share|cite|improve this question







New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.











share|cite|improve this question







New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









share|cite|improve this question




share|cite|improve this question






New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.









asked 2 hours ago









SlowMagicSlowMagic

1111




1111




New contributor




SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.





New contributor





SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.






SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.












  • $begingroup$
    related stats.stackexchange.com/questions/245902/…
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    do you know which dafodil is which varaiety? If so, then you can just include that information into your model
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
    $endgroup$
    – rolando2
    1 hour ago


















  • $begingroup$
    related stats.stackexchange.com/questions/245902/…
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    do you know which dafodil is which varaiety? If so, then you can just include that information into your model
    $endgroup$
    – rep_ho
    2 hours ago






  • 1




    $begingroup$
    This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
    $endgroup$
    – rolando2
    1 hour ago
















$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago




$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago




1




1




$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago




$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago




1




1




$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago




$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago










1 Answer
1






active

oldest

votes


















3












$begingroup$

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.



Your model will look like



$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$



Here is an example in R. I've generated some data that looks like this:



enter image description here



Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.



library(tidyverse)

#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)

y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)

d = data_frame(ph = ph, species = species, y = y)

#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)


And the result is



> summary(model)

Call:
lm(formula = y ~ species * ph, data = d)

Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428

Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16


For species labeled 0, the line is approximately



$$ y = 19 + 1cdot text{PH}$$



For species labeled 1, the line is approximately



$$ y = 40 - 2 cdot text{PH} $$






share|cite|improve this answer











$endgroup$













    Your Answer





    StackExchange.ifUsing("editor", function () {
    return StackExchange.using("mathjaxEditing", function () {
    StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
    StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
    });
    });
    }, "mathjax-editing");

    StackExchange.ready(function() {
    var channelOptions = {
    tags: "".split(" "),
    id: "65"
    };
    initTagRenderer("".split(" "), "".split(" "), channelOptions);

    StackExchange.using("externalEditor", function() {
    // Have to fire editor after snippets, if snippets enabled
    if (StackExchange.settings.snippets.snippetsEnabled) {
    StackExchange.using("snippets", function() {
    createEditor();
    });
    }
    else {
    createEditor();
    }
    });

    function createEditor() {
    StackExchange.prepareEditor({
    heartbeatType: 'answer',
    autoActivateHeartbeat: false,
    convertImagesToLinks: false,
    noModals: true,
    showLowRepImageUploadWarning: true,
    reputationToPostImages: null,
    bindNavPrevention: true,
    postfix: "",
    imageUploader: {
    brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
    contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
    allowUrls: true
    },
    onDemand: true,
    discardSelector: ".discard-answer"
    ,immediatelyShowMarkdownHelp:true
    });


    }
    });






    SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.










    draft saved

    draft discarded


















    StackExchange.ready(
    function () {
    StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389545%2fwhat-if-my-linear-regression-data-contains-several-co-mingled-linear-relationshi%23new-answer', 'question_page');
    }
    );

    Post as a guest















    Required, but never shown

























    1 Answer
    1






    active

    oldest

    votes








    1 Answer
    1






    active

    oldest

    votes









    active

    oldest

    votes






    active

    oldest

    votes









    3












    $begingroup$

    If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.



    Your model will look like



    $$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$



    Here is an example in R. I've generated some data that looks like this:



    enter image description here



    Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.



    library(tidyverse)

    #Simulate the data
    N = 1000
    ph = runif(N,5,8)
    species = rbinom(N,1,0.5)

    y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
    y = as.numeric(y)

    d = data_frame(ph = ph, species = species, y = y)

    #Estimate the model
    model = lm(y~species*ph, data = d)
    summary(model)


    And the result is



    > summary(model)

    Call:
    lm(formula = y ~ species * ph, data = d)

    Residuals:
    Min 1Q Median 3Q Max
    -1.61884 -0.31976 -0.00226 0.33521 1.46428

    Coefficients:
    Estimate Std. Error t value Pr(>|t|)
    (Intercept) 19.85850 0.17484 113.58 <2e-16 ***
    species 20.31363 0.24626 82.49 <2e-16 ***
    ph 1.01599 0.02671 38.04 <2e-16 ***
    species:ph -3.03174 0.03756 -80.72 <2e-16 ***
    ---
    Signif. codes:
    0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

    Residual standard error: 0.4997 on 996 degrees of freedom
    Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
    F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16


    For species labeled 0, the line is approximately



    $$ y = 19 + 1cdot text{PH}$$



    For species labeled 1, the line is approximately



    $$ y = 40 - 2 cdot text{PH} $$






    share|cite|improve this answer











    $endgroup$


















      3












      $begingroup$

      If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.



      Your model will look like



      $$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$



      Here is an example in R. I've generated some data that looks like this:



      enter image description here



      Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.



      library(tidyverse)

      #Simulate the data
      N = 1000
      ph = runif(N,5,8)
      species = rbinom(N,1,0.5)

      y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
      y = as.numeric(y)

      d = data_frame(ph = ph, species = species, y = y)

      #Estimate the model
      model = lm(y~species*ph, data = d)
      summary(model)


      And the result is



      > summary(model)

      Call:
      lm(formula = y ~ species * ph, data = d)

      Residuals:
      Min 1Q Median 3Q Max
      -1.61884 -0.31976 -0.00226 0.33521 1.46428

      Coefficients:
      Estimate Std. Error t value Pr(>|t|)
      (Intercept) 19.85850 0.17484 113.58 <2e-16 ***
      species 20.31363 0.24626 82.49 <2e-16 ***
      ph 1.01599 0.02671 38.04 <2e-16 ***
      species:ph -3.03174 0.03756 -80.72 <2e-16 ***
      ---
      Signif. codes:
      0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

      Residual standard error: 0.4997 on 996 degrees of freedom
      Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
      F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16


      For species labeled 0, the line is approximately



      $$ y = 19 + 1cdot text{PH}$$



      For species labeled 1, the line is approximately



      $$ y = 40 - 2 cdot text{PH} $$






      share|cite|improve this answer











      $endgroup$
















        3












        3








        3





        $begingroup$

        If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.



        Your model will look like



        $$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$



        Here is an example in R. I've generated some data that looks like this:



        enter image description here



        Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.



        library(tidyverse)

        #Simulate the data
        N = 1000
        ph = runif(N,5,8)
        species = rbinom(N,1,0.5)

        y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
        y = as.numeric(y)

        d = data_frame(ph = ph, species = species, y = y)

        #Estimate the model
        model = lm(y~species*ph, data = d)
        summary(model)


        And the result is



        > summary(model)

        Call:
        lm(formula = y ~ species * ph, data = d)

        Residuals:
        Min 1Q Median 3Q Max
        -1.61884 -0.31976 -0.00226 0.33521 1.46428

        Coefficients:
        Estimate Std. Error t value Pr(>|t|)
        (Intercept) 19.85850 0.17484 113.58 <2e-16 ***
        species 20.31363 0.24626 82.49 <2e-16 ***
        ph 1.01599 0.02671 38.04 <2e-16 ***
        species:ph -3.03174 0.03756 -80.72 <2e-16 ***
        ---
        Signif. codes:
        0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        Residual standard error: 0.4997 on 996 degrees of freedom
        Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
        F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16


        For species labeled 0, the line is approximately



        $$ y = 19 + 1cdot text{PH}$$



        For species labeled 1, the line is approximately



        $$ y = 40 - 2 cdot text{PH} $$






        share|cite|improve this answer











        $endgroup$



        If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.



        Your model will look like



        $$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$



        Here is an example in R. I've generated some data that looks like this:



        enter image description here



        Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.



        library(tidyverse)

        #Simulate the data
        N = 1000
        ph = runif(N,5,8)
        species = rbinom(N,1,0.5)

        y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
        y = as.numeric(y)

        d = data_frame(ph = ph, species = species, y = y)

        #Estimate the model
        model = lm(y~species*ph, data = d)
        summary(model)


        And the result is



        > summary(model)

        Call:
        lm(formula = y ~ species * ph, data = d)

        Residuals:
        Min 1Q Median 3Q Max
        -1.61884 -0.31976 -0.00226 0.33521 1.46428

        Coefficients:
        Estimate Std. Error t value Pr(>|t|)
        (Intercept) 19.85850 0.17484 113.58 <2e-16 ***
        species 20.31363 0.24626 82.49 <2e-16 ***
        ph 1.01599 0.02671 38.04 <2e-16 ***
        species:ph -3.03174 0.03756 -80.72 <2e-16 ***
        ---
        Signif. codes:
        0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1

        Residual standard error: 0.4997 on 996 degrees of freedom
        Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
        F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16


        For species labeled 0, the line is approximately



        $$ y = 19 + 1cdot text{PH}$$



        For species labeled 1, the line is approximately



        $$ y = 40 - 2 cdot text{PH} $$







        share|cite|improve this answer














        share|cite|improve this answer



        share|cite|improve this answer








        edited 1 hour ago

























        answered 2 hours ago









        Demetri PananosDemetri Pananos

        1,069317




        1,069317






















            SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.










            draft saved

            draft discarded


















            SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.













            SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.












            SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
















            Thanks for contributing an answer to Cross Validated!


            • Please be sure to answer the question. Provide details and share your research!

            But avoid



            • Asking for help, clarification, or responding to other answers.

            • Making statements based on opinion; back them up with references or personal experience.


            Use MathJax to format equations. MathJax reference.


            To learn more, see our tips on writing great answers.




            draft saved


            draft discarded














            StackExchange.ready(
            function () {
            StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389545%2fwhat-if-my-linear-regression-data-contains-several-co-mingled-linear-relationshi%23new-answer', 'question_page');
            }
            );

            Post as a guest















            Required, but never shown





















































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown

































            Required, but never shown














            Required, but never shown












            Required, but never shown







            Required, but never shown







            Popular posts from this blog

            SQL Server 17 - Attemping to backup to remote NAS but Access is denied

            Always On Availability groups resolving state after failover - Remote harden of transaction...

            Restoring from pg_dump with foreign key constraints