What if my linear regression data contains several co-mingled linear relationships?

Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.

However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:

soil pH vs flower height (cm)

I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.

Questions:

Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?

How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?

I can think of some combinatorial approaches, but they seem computationally expensive.

asked 2 hours ago

SlowMagic

1111

New contributor

$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago

add a comment |

soil pH vs flower height (cm)

I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.

Questions:

Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?

How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?

I can think of some combinatorial approaches, but they seem computationally expensive.

asked 2 hours ago

SlowMagic

1111

New contributor

$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago

add a comment |

soil pH vs flower height (cm)

I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.

Questions:

Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?

How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?

I can think of some combinatorial approaches, but they seem computationally expensive.

asked 2 hours ago

SlowMagic

1111

New contributor

soil pH vs flower height (cm)

I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.

Questions:

Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?

How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?

I can think of some combinatorial approaches, but they seem computationally expensive.

regression linear-model dataset

asked 2 hours ago

SlowMagic

1111

New contributor

asked 2 hours ago

SlowMagic

1111

New contributor

asked 2 hours ago

SlowMagic

1111

New contributor

asked 2 hours ago

SlowMagic

1111

asked 2 hours ago

SlowMagic

1111

New contributor

SlowMagic is a new contributor to this site. Take care in asking for clarification, commenting, and answering.
Check out our Code of Conduct.

$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago

add a comment |

$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago

1

$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago

related stats.stackexchange.com/questions/245902/…

– rep_ho
2 hours ago

do you know which dafodil is which varaiety? If so, then you can just include that information into your model

– rep_ho
2 hours ago

This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.

– rolando2
1 hour ago

add a comment |

1 Answer
1

active

oldest

votes

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.

Your model will look like

$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$

Here is an example in R. I've generated some data that looks like this:

enter image description here

Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.

library(tidyverse)



#Simulate the data

N = 1000

ph = runif(N,5,8)

species = rbinom(N,1,0.5)



y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)

y = as.numeric(y)



d = data_frame(ph = ph, species = species, y = y)



#Estimate the model

model = lm(y~species*ph, data = d)

summary(model)

And the result is

> summary(model)



Call:

lm(formula = y ~ species * ph, data = d)



Residuals:

     Min       1Q   Median       3Q      Max 

-1.61884 -0.31976 -0.00226  0.33521  1.46428 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 19.85850    0.17484  113.58   <2e-16 ***

species     20.31363    0.24626   82.49   <2e-16 ***

ph           1.01599    0.02671   38.04   <2e-16 ***

species:ph  -3.03174    0.03756  -80.72   <2e-16 ***

---

Signif. codes:  

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 0.4997 on 996 degrees of freedom

Multiple R-squared:  0.8844,    Adjusted R-squared:  0.8841 

F-statistic:  2541 on 3 and 996 DF,  p-value: < 2.2e-16

For species labeled 0, the line is approximately

$$ y = 19 + 1cdot text{PH}$$

For species labeled 1, the line is approximately

$$ y = 40 - 2 cdot text{PH} $$

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

add a comment |

Your Answer

StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\$","\$"]]);
});
});
}, "mathjax-editing");

StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);

StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});

function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});

}
});

SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389545%2fwhat-if-my-linear-regression-data-contains-several-co-mingled-linear-relationshi%23new-answer', 'question_page');
}
);

Post as a guest

Name

Required, but never shown

1 Answer
1

active

oldest

votes

1 Answer
1

active

oldest

votes

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.

Your model will look like

$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$

Here is an example in R. I've generated some data that looks like this:

enter image description here

Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.

library(tidyverse)



#Simulate the data

N = 1000

ph = runif(N,5,8)

species = rbinom(N,1,0.5)



y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)

y = as.numeric(y)



d = data_frame(ph = ph, species = species, y = y)



#Estimate the model

model = lm(y~species*ph, data = d)

summary(model)

And the result is

> summary(model)



Call:

lm(formula = y ~ species * ph, data = d)



Residuals:

     Min       1Q   Median       3Q      Max 

-1.61884 -0.31976 -0.00226  0.33521  1.46428 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 19.85850    0.17484  113.58   <2e-16 ***

species     20.31363    0.24626   82.49   <2e-16 ***

ph           1.01599    0.02671   38.04   <2e-16 ***

species:ph  -3.03174    0.03756  -80.72   <2e-16 ***

---

Signif. codes:  

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 0.4997 on 996 degrees of freedom

Multiple R-squared:  0.8844,    Adjusted R-squared:  0.8841 

F-statistic:  2541 on 3 and 996 DF,  p-value: < 2.2e-16

For species labeled 0, the line is approximately

$$ y = 19 + 1cdot text{PH}$$

For species labeled 1, the line is approximately

$$ y = 40 - 2 cdot text{PH} $$

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

add a comment |

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.

Your model will look like

$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$

Here is an example in R. I've generated some data that looks like this:

enter image description here

Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.

library(tidyverse)



#Simulate the data

N = 1000

ph = runif(N,5,8)

species = rbinom(N,1,0.5)



y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)

y = as.numeric(y)



d = data_frame(ph = ph, species = species, y = y)



#Estimate the model

model = lm(y~species*ph, data = d)

summary(model)

And the result is

> summary(model)



Call:

lm(formula = y ~ species * ph, data = d)



Residuals:

     Min       1Q   Median       3Q      Max 

-1.61884 -0.31976 -0.00226  0.33521  1.46428 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 19.85850    0.17484  113.58   <2e-16 ***

species     20.31363    0.24626   82.49   <2e-16 ***

ph           1.01599    0.02671   38.04   <2e-16 ***

species:ph  -3.03174    0.03756  -80.72   <2e-16 ***

---

Signif. codes:  

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 0.4997 on 996 degrees of freedom

Multiple R-squared:  0.8844,    Adjusted R-squared:  0.8841 

F-statistic:  2541 on 3 and 996 DF,  p-value: < 2.2e-16

For species labeled 0, the line is approximately

$$ y = 19 + 1cdot text{PH}$$

For species labeled 1, the line is approximately

$$ y = 40 - 2 cdot text{PH} $$

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

add a comment |

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.

Your model will look like

$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$

Here is an example in R. I've generated some data that looks like this:

enter image description here

Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.

library(tidyverse)



#Simulate the data

N = 1000

ph = runif(N,5,8)

species = rbinom(N,1,0.5)



y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)

y = as.numeric(y)



d = data_frame(ph = ph, species = species, y = y)



#Estimate the model

model = lm(y~species*ph, data = d)

summary(model)

And the result is

> summary(model)



Call:

lm(formula = y ~ species * ph, data = d)



Residuals:

     Min       1Q   Median       3Q      Max 

-1.61884 -0.31976 -0.00226  0.33521  1.46428 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 19.85850    0.17484  113.58   <2e-16 ***

species     20.31363    0.24626   82.49   <2e-16 ***

ph           1.01599    0.02671   38.04   <2e-16 ***

species:ph  -3.03174    0.03756  -80.72   <2e-16 ***

---

Signif. codes:  

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 0.4997 on 996 degrees of freedom

Multiple R-squared:  0.8844,    Adjusted R-squared:  0.8841 

F-statistic:  2541 on 3 and 996 DF,  p-value: < 2.2e-16

For species labeled 0, the line is approximately

$$ y = 19 + 1cdot text{PH}$$

For species labeled 1, the line is approximately

$$ y = 40 - 2 cdot text{PH} $$

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.

Your model will look like

$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$

Here is an example in R. I've generated some data that looks like this:

enter image description here

Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.

library(tidyverse)



#Simulate the data

N = 1000

ph = runif(N,5,8)

species = rbinom(N,1,0.5)



y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)

y = as.numeric(y)



d = data_frame(ph = ph, species = species, y = y)



#Estimate the model

model = lm(y~species*ph, data = d)

summary(model)

And the result is

> summary(model)



Call:

lm(formula = y ~ species * ph, data = d)



Residuals:

     Min       1Q   Median       3Q      Max 

-1.61884 -0.31976 -0.00226  0.33521  1.46428 



Coefficients:

            Estimate Std. Error t value Pr(>|t|)    

(Intercept) 19.85850    0.17484  113.58   <2e-16 ***

species     20.31363    0.24626   82.49   <2e-16 ***

ph           1.01599    0.02671   38.04   <2e-16 ***

species:ph  -3.03174    0.03756  -80.72   <2e-16 ***

---

Signif. codes:  

0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1



Residual standard error: 0.4997 on 996 degrees of freedom

Multiple R-squared:  0.8844,    Adjusted R-squared:  0.8841 

F-statistic:  2541 on 3 and 996 DF,  p-value: < 2.2e-16

For species labeled 0, the line is approximately

$$ y = 19 + 1cdot text{PH}$$

For species labeled 1, the line is approximately

$$ y = 40 - 2 cdot text{PH} $$

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

edited 1 hour ago

answered 2 hours ago

Demetri Pananos

1,069317

answered 2 hours ago

Demetri Pananos

1,069317

answered 2 hours ago

Demetri Pananos

1,069317

add a comment |

SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.

draft saved

draft discarded

SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.

Thanks for contributing an answer to Cross Validated!

Please be sure to answer the question. Provide details and share your research!

But avoid …

Asking for help, clarification, or responding to other answers.

Making statements based on opinion; back them up with references or personal experience.

Use MathJax to format equations. MathJax reference.

To learn more, see our tips on writing great answers.

draft saved

draft discarded

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Sign up or log in

StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});

Post as a guest

Name

Required, but never shown

Name

Required, but never shown

Name

Required, but never shown

This page is only for reference, If you need detailed information, please check here

pWB5OO

搜尋此網誌

Xrhrft