Como você pode lidar com estimativas instáveis ​​de

13

Estabilidade beta em regressão linear com alta multicolinearidade?

Digamos que em uma regressão linear, as variáveis x1 e x2 têm alta multicolinearidade (a correlação é de cerca de 0,9).

Estamos preocupados com a estabilidade do coeficiente β , portanto temos que tratar a multicolinearidade.

A solução do livro seria apenas jogar fora uma das variáveis.

Mas não queremos perder informações úteis simplesmente jogando fora as variáveis.

Alguma sugestão?

Luna
fonte
5
Você já tentou algum tipo de esquema de regularização (por exemplo, regressão de crista)?
Néstor

Respostas:

11

Você pode tentar a abordagem de regressão de crista no caso em que a matriz de correlação é quase singular (ou seja, as variáveis ​​têm altas correlações). Ele fornecerá uma estimativa robusta de .β

A única questão é como escolher o parâmetro de regularização . Não é um problema simples, embora eu sugira tentar valores diferentes.λ

Espero que isto ajude!

Paulo
fonte
2
A validação cruzada é a coisa usual a fazer para escolher ;-). λ
Néstor
de fato (+1 para resposta e comentário de Nestors) e, se você realizar os cálculos em "forma canônica" (usando uma decomposição autônoma de , poderá encontrar o λ minimizando o erro de validação cruzada de exclusão por: o método de Newton muito mais barato.XTXλ
Dikran Marsupial
Muito obrigado! Algum tutorial / notas sobre como fazer isso, incluindo a validação cruzada no R?
Luna
Confira o capítulo 3 deste livro: stanford.edu/~hastie/local.ftp/Springer/ESLII_print5.pdf . A implementação da regressão de cume é feita em R por alguns dos autores (o Google é seu amigo!).
Néstor
2
Você pode usar a lm.ridgerotina no pacote MASS. Se você passar um intervalo de valores para , por exemplo, uma chamada como , você retornará as estatísticas generalizadas de validação cruzada e poderá plotá-las contra λ : para escolher o mínimo. λfoo <- lm.ridge(y~x1+x2,lambda=seq(0,10,by=0.1))fooλplot(foo$GCV~foo$lambda)
21132 jbowman
10

Bem, há um método ad hoc que eu já usei antes. Não tenho certeza se esse procedimento tem um nome, mas faz sentido intuitivamente.

Suponha que seu objetivo seja ajustar o modelo

Yi=β0+β1Xi+β2Zi+εi

onde os dois preditores - - estão altamente correlacionados. Como você apontou, usá-los no mesmo modelo pode fazer coisas estranhas às estimativas do coeficiente e valores- p . Uma alternativa é ajustar o modeloXi,Zip

Zi=α0+α1Xi+ηi

Em seguida, o resíduo será não correlacionado com X i e pode, de alguma forma, ser considerada como a parte de Z i que não é absorvido pela sua relação linear com X i . Em seguida, você pode prosseguir para ajustar o modeloηiXiZiXi

Yi=θ0+θ1Xi+θ2ηi+νi

que irá capturar todos os efeitos do primeiro modelo (e, de fato, têm exatamente o mesmo como o primeiro modelo), mas os preditores já não são colineares.R2

Editar: O OP solicitou uma explicação de por que os resíduos não possuem, por definição, uma correlação de amostra zero com o preditor quando você omite a interceptação, como eles fazem quando a interceptação é incluída. Como é muito longo para postar nos comentários, fiz uma edição aqui. Essa derivação não é particularmente esclarecedora (infelizmente não consegui apresentar um argumento intuitivo razoável), mas mostra o que o OP solicitou :

Quando a intercepção é omitido na regressão linear simples , β = Σ x i y iβ^=xiyixi2ei=yixixiyixi2xiei

xe¯x¯e¯
¯

Primeiro nós temos

xe¯=1n(xiyixi2xiyixi2)=xy¯(1xi2xi2)=0

but

x¯e¯=x¯(y¯x¯xy¯x2¯)=x¯y¯x¯2xy¯x2¯

so in order for the ei and xi to have a sample correlation of exactly 0, we need x¯e¯ to be 0. That is, we need

y¯=x¯xy¯x2¯

which does not hold in general for two arbitrary sets of data x,y.

Macro
fonte
This reminds me of partial regression plots.
Andy W
3
This sounds like an approximation to replacing (X,Z) by their principal components.
whuber
3
One thing I had in mind is that PCA generalizes easily to more than two variables. Another is that it treats X and Z symmetrically, whereas your proposal appears arbitrarily to single out one of these variables. Another thought was that PCA provides a disciplined way to reduce the number of variables (although one must be cautious about that, because a small principal component may be highly correlated with the dependent variable).
whuber
1
Hi Macro, Thank you for the excellent proof. Yeah now I understand it. When we talk about sample correlation between x and residuals, it requires the intercept term to be included for the sample correlation to be 0. On the other hand, when we talk about orthogonality between x and residuals, it doesn't require the intercept term to be included, for the orthogonality to hold.
Luna
1
@Luna, I don't particularly disagree with using ridge regression - this was just what first occurred to me (I answered before that was suggested). One thing I can say is that ridge regression estimate are biased, so, in some sense, you're actually estimating a slightly different (shrunken) quantity than you are with ordinary regression, making the interpretation of the coefficients perhaps more challenging (as gung alludes to). Also, what I've described here only requires understanding of basic linear regression and may be more intuitively appealing to some.
Macro
4

I like both of the answers given thus far. Let me add a few things.

Another option is that you can also combine the variables. This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. This would be a good approach when you believe they are two different measures of the same underlying construct. In that case, you have two measurements that are contaminated with error. The most likely true value for the variable you really care about is in between them, thus averaging them gives a more accurate estimate. You standardize them first to put them on the same scale, so that nominal issues don't contaminate the result (e.g., you wouldn't want to average several temperature measurements if some are Fahrenheit and some are Celsius). Of course, if they are already on the same scale (e.g., several highly-correlated public opinion polls), you can skip that step. If you think one of your variables might be more accurate than the other, you could do a weighted average (perhaps using the reciprocals of the measurement errors).

If your variables are just different measures of the same construct, and are sufficiently highly correlated, you really could just throw one out without losing much information. As an example, I was actually in a situation once, where I wanted to use a covariate to absorb some of the error variance and boost power, but where I didn't care about that covariate--it wasn't germane substantively. I had several options available and they were all correlated with each other r>.98. I basically picked one at random and moved on, and it worked fine. I suspect I would have lost power burning two extra degrees of freedom if I had included the others as well by using some other strategy. Of course, I could have combined them, but why bother? However, this depends critically on the fact that your variables are correlated because they are two different versions of the same thing; if there's a different reason they are correlated, this could be totally inappropriate.

As that implies, I suggest you think about what lies behind your correlated variables. That is, you need a theory of why they're so highly correlated to do the best job of picking which strategy to use. In addition to different measures of the same latent variable, some other possibilities are a causal chain (i.e., X1X2Y) and more complicated situations in which your variables are the result of multiple causal forces, some of which are the same for both. Perhaps the most extreme case is that of a suppressor variable, which @whuber describes in his comment below. @Macro's suggestion, for instance, assumes that you are primarily interested in X and wonder about the additional contribution of Z after having accounted for X's contribution. Thus, thinking about why your variables are correlated and what you want to know will help you decide which (i.e., x1 or x2) should be treated as X and which Z. The key is to use theoretical insight to inform your choice.

I agree that ridge regression is arguably better, because it allows you to use the variables you had originally intended and is likely to yield betas that are very close to their true values (although they will be biased--see here or here for more information). Nonetheless, I think is also has two potential downsides: It is more complicated (requiring more statistical sophistication), and the resulting model is more difficult to interpret, in my opinion.

I gather that perhaps the ultimate approach would be to fit a structural equation model. That's because it would allow you to formulate the exact set of relationships you believe to be operative, including latent variables. However, I don't know SEM well enough to say anything about it here, other than to mention the possibility. (I also suspect it would be overkill in the situation you describe with just two covariates.)

gung - Reinstate Monica
fonte
4
Re the first point: Let vector X1 have a range of values and let vector e have small values completely uncorrelated with X1 so that X2=X1+e is highly correlated with X1. Set Y=e. In the regression of Y against either X1 or X2 you will see no significant or important results. In the regression of Y against X1 and X2 you will get an extremely good fit, because Y=X2X1. Thus, if you throw out either of X1 or X2, you will have lost essentially all information about Y. Whence, "highly correlated" does not mean "have equivalent information about Y".
whuber
Thanks a lot Gung! Q1. Why does this approach work: "This is done by standardizing both (i.e., turning them into z-scores), averaging them, and then fitting your model with only the composite variable. "? Q2. Why would Ridge Regression be better? Q3. Why would SEM be better? Anybody please shed some lights on this? Thank you!
Luna
Hi Luna, glad to help. I'm actually going to re-edit this; @whuber was more right than I had initially realized. I'll try to put in more to help w/ your additional questions, but it'll take a lot, so it might be a while. We'll see how it goes.
gung - Reinstate Monica