What if my linear regression data contains several co-mingled linear relationships?
$begingroup$
Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.
However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:
I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.
Questions:
Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?
How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?
I can think of some combinatorial approaches, but they seem computationally expensive.
regression linear-model dataset
New contributor
$endgroup$
add a comment |
$begingroup$
Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.
However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:
I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.
Questions:
Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?
How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?
I can think of some combinatorial approaches, but they seem computationally expensive.
regression linear-model dataset
New contributor
$endgroup$
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago
add a comment |
$begingroup$
Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.
However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:
I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.
Questions:
Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?
How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?
I can think of some combinatorial approaches, but they seem computationally expensive.
regression linear-model dataset
New contributor
$endgroup$
Let's say I am studying how daffodils respond to various soil conditions. I have collected data on the pH of the soil versus the mature height of the daffodil. I'm expecting a linear relationship, so I go about running a linear regression.
However, I didn't realize when I started my study that the population actually contains two varieties of daffodil, each of which responds very differently to soil pH. So the graph contains two distinct linear relationships:
I can eyeball it and separate it manually, of course. But I wonder if there is a more rigorous approach.
Questions:
Is there a statistical test to determine whether a data set would be better fit by a single line or by N lines?
How would I run a linear regression to fit the N lines? In other words, how do I disentangle the co-mingled data?
I can think of some combinatorial approaches, but they seem computationally expensive.
regression linear-model dataset
regression linear-model dataset
New contributor
New contributor
New contributor
asked 2 hours ago
SlowMagicSlowMagic
1111
1111
New contributor
New contributor
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago
add a comment |
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
1
1
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
1
1
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago
add a comment |
1 Answer
1
active
oldest
votes
$begingroup$
If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.
Your model will look like
$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$
Here is an example in R. I've generated some data that looks like this:
Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
And the result is
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
For species labeled 0, the line is approximately
$$ y = 19 + 1cdot text{PH}$$
For species labeled 1, the line is approximately
$$ y = 40 - 2 cdot text{PH} $$
$endgroup$
add a comment |
Your Answer
StackExchange.ifUsing("editor", function () {
return StackExchange.using("mathjaxEditing", function () {
StackExchange.MarkdownEditor.creationCallbacks.add(function (editor, postfix) {
StackExchange.mathjaxEditing.prepareWmdForMathJax(editor, postfix, [["$", "$"], ["\\(","\\)"]]);
});
});
}, "mathjax-editing");
StackExchange.ready(function() {
var channelOptions = {
tags: "".split(" "),
id: "65"
};
initTagRenderer("".split(" "), "".split(" "), channelOptions);
StackExchange.using("externalEditor", function() {
// Have to fire editor after snippets, if snippets enabled
if (StackExchange.settings.snippets.snippetsEnabled) {
StackExchange.using("snippets", function() {
createEditor();
});
}
else {
createEditor();
}
});
function createEditor() {
StackExchange.prepareEditor({
heartbeatType: 'answer',
autoActivateHeartbeat: false,
convertImagesToLinks: false,
noModals: true,
showLowRepImageUploadWarning: true,
reputationToPostImages: null,
bindNavPrevention: true,
postfix: "",
imageUploader: {
brandingHtml: "Powered by u003ca class="icon-imgur-white" href="https://imgur.com/"u003eu003c/au003e",
contentPolicyHtml: "User contributions licensed under u003ca href="https://creativecommons.org/licenses/by-sa/3.0/"u003ecc by-sa 3.0 with attribution requiredu003c/au003e u003ca href="https://stackoverflow.com/legal/content-policy"u003e(content policy)u003c/au003e",
allowUrls: true
},
onDemand: true,
discardSelector: ".discard-answer"
,immediatelyShowMarkdownHelp:true
});
}
});
SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389545%2fwhat-if-my-linear-regression-data-contains-several-co-mingled-linear-relationshi%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
1 Answer
1
active
oldest
votes
1 Answer
1
active
oldest
votes
active
oldest
votes
active
oldest
votes
$begingroup$
If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.
Your model will look like
$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$
Here is an example in R. I've generated some data that looks like this:
Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
And the result is
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
For species labeled 0, the line is approximately
$$ y = 19 + 1cdot text{PH}$$
For species labeled 1, the line is approximately
$$ y = 40 - 2 cdot text{PH} $$
$endgroup$
add a comment |
$begingroup$
If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.
Your model will look like
$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$
Here is an example in R. I've generated some data that looks like this:
Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
And the result is
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
For species labeled 0, the line is approximately
$$ y = 19 + 1cdot text{PH}$$
For species labeled 1, the line is approximately
$$ y = 40 - 2 cdot text{PH} $$
$endgroup$
add a comment |
$begingroup$
If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.
Your model will look like
$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$
Here is an example in R. I've generated some data that looks like this:
Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
And the result is
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
For species labeled 0, the line is approximately
$$ y = 19 + 1cdot text{PH}$$
For species labeled 1, the line is approximately
$$ y = 40 - 2 cdot text{PH} $$
$endgroup$
If you know which sample comes from which variety of daffodil, you can estimate an interaction between variety and soil PH.
Your model will look like
$$ y = beta_0 + beta_1 text{variety} + beta_2text{PH} + beta_3text{variety}cdottext{PH} $$
Here is an example in R. I've generated some data that looks like this:
Clearly two different lines, and the lines correspond to two species. Here is how to estimate the lines using linear regression.
library(tidyverse)
#Simulate the data
N = 1000
ph = runif(N,5,8)
species = rbinom(N,1,0.5)
y = model.matrix(~ph*species)%*% c(20,1,20,-3) + rnorm(N, 0, 0.5)
y = as.numeric(y)
d = data_frame(ph = ph, species = species, y = y)
#Estimate the model
model = lm(y~species*ph, data = d)
summary(model)
And the result is
> summary(model)
Call:
lm(formula = y ~ species * ph, data = d)
Residuals:
Min 1Q Median 3Q Max
-1.61884 -0.31976 -0.00226 0.33521 1.46428
Coefficients:
Estimate Std. Error t value Pr(>|t|)
(Intercept) 19.85850 0.17484 113.58 <2e-16 ***
species 20.31363 0.24626 82.49 <2e-16 ***
ph 1.01599 0.02671 38.04 <2e-16 ***
species:ph -3.03174 0.03756 -80.72 <2e-16 ***
---
Signif. codes:
0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1
Residual standard error: 0.4997 on 996 degrees of freedom
Multiple R-squared: 0.8844, Adjusted R-squared: 0.8841
F-statistic: 2541 on 3 and 996 DF, p-value: < 2.2e-16
For species labeled 0, the line is approximately
$$ y = 19 + 1cdot text{PH}$$
For species labeled 1, the line is approximately
$$ y = 40 - 2 cdot text{PH} $$
edited 1 hour ago
answered 2 hours ago
Demetri PananosDemetri Pananos
1,069317
1,069317
add a comment |
add a comment |
SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
SlowMagic is a new contributor. Be nice, and check out our Code of Conduct.
Thanks for contributing an answer to Cross Validated!
- Please be sure to answer the question. Provide details and share your research!
But avoid …
- Asking for help, clarification, or responding to other answers.
- Making statements based on opinion; back them up with references or personal experience.
Use MathJax to format equations. MathJax reference.
To learn more, see our tips on writing great answers.
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
StackExchange.ready(
function () {
StackExchange.openid.initPostLogin('.new-post-login', 'https%3a%2f%2fstats.stackexchange.com%2fquestions%2f389545%2fwhat-if-my-linear-regression-data-contains-several-co-mingled-linear-relationshi%23new-answer', 'question_page');
}
);
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Sign up or log in
StackExchange.ready(function () {
StackExchange.helpers.onClickDraftSave('#login-link');
});
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Sign up using Google
Sign up using Facebook
Sign up using Email and Password
Post as a guest
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
Required, but never shown
$begingroup$
related stats.stackexchange.com/questions/245902/…
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
do you know which dafodil is which varaiety? If so, then you can just include that information into your model
$endgroup$
– rep_ho
2 hours ago
1
$begingroup$
This seems a classic case of statistical interaction, as in @Demetri Pananos's answer.
$endgroup$
– rolando2
1 hour ago