A primeira frase desta página wiki afirma que "Na econometria, um problema de endogeneidade ocorre quando uma variável explicativa é correlacionada com o termo do erro. 1 "
Minha pergunta é: como isso pode acontecer? A regressão beta não é escolhida de forma que o termo do erro seja ortogonal ao espaço da coluna da matriz de design?
regression
habitante do norte
fonte
fonte
Respostas:
Você está confundindo dois tipos de termo "erro". A Wikipedia, na verdade, tem um artigo dedicado a essa distinção entre erros e resíduos .
Em uma regressão OLS, os resíduos (as estimativas do termo de erro ou estão de fato garantidos para ser não correlacionadas com as variáveis de previsão, assumindo a regressão contém um termo de intercepto.ε^
Mas os erros "verdadeiros" podem muito bem estar correlacionados com eles, e é isso que conta como endogeneidade.ε
Para simplificar, considere o modelo de regressão (você pode ver isso descrito como o " processo de geração de dados " ou "DGP" subjacente , o modelo teórico que assumimos para gerar o valor de ):y
Não há razão, em princípio, por que não possa ser correlacionado com ε em nosso modelo, por mais que preferimos não violar as premissas OLS padrão dessa maneira. Por exemplo, pode ser que y dependa de outra variável omitida do nosso modelo, e isso tenha sido incorporado ao termo de perturbação (o ε é onde agrupamos todas as outras coisas além de x que afetam y ). Se essa variável omitida também estiver correlacionada com x , então ε será correlacionado com x e teremos endogeneidade (em particular, viés da variável omitida ).x ε y ε x y x ε x
Quando você estima seu modelo de regressão com os dados disponíveis, obtemos
Devido à forma como MQO obras *, os resíduos ε será uncorrelated com x . Mas isso não significa que nós temos endogeneidade evitado - Significa apenas que nós não podemos detectá-lo através da análise da correlação entre ε e x , que será (até erro numérico) zero. E como as suposições do OLS foram violadas, não temos mais garantia de boas propriedades, como imparcialidade, gostamos muito do OLS. A nossa estimativa β 2 vai ser tendencioso.ε^ x ε^ x β^2
O fato de que ε não está correlacionada com x segue imediatamente a partir das "equações normais" que usamos para escolher nossas melhores estimativas para os coeficientes.(∗) ε^ x
Se você não está acostumado à configuração da matriz, e eu continuo com o modelo bivariado usado no meu exemplo acima, a soma dos resíduos quadráticos é e para encontrar a melhor b 1 = β 1 e b 2 =S(b1,b2)=∑ni=1ε2i=∑ni=1(yi−b1−b2xi)2 b1=β^1 b2=β^2 that minimise this we find the normal equations, firstly the first-order condition for the estimated intercept:
which shows that the sum (and hence mean) of the residuals is zero, so the formula for the covariance betweenε^ and any variable x then reduces to 1n−1∑ni=1xiε^i . We see this is zero by considering the first-order condition for the estimated slope, which is that
If you are used to working with matrices, we can generalise this to multiple regression by definingS(b)=ε′ε=(y−Xb)′(y−Xb) ; the first-order condition to minimise S(b) at optimal b=β^ is:
This implies each row ofX′ , and hence each column of X , is orthogonal to ε^ . Then if the design matrix X has a column of ones (which happens if your model has an intercept term), we must have ∑ni=1ε^i=0 so the residuals have zero sum and zero mean. The covariance between ε^ and any variable x is again 1n−1∑ni=1xiε^i and for any variable x included in our model we know this sum is zero, because ε^ is orthogonal to every column of the design matrix. Hence there is zero covariance, and zero correlation, between ε^ and any predictor variable x .
If you prefer a more geometric view of things, our desire thaty^ lies as close as possible to y in a Pythagorean kind of way, and the fact that y^ is constrained to the column space of the design matrix X , dictate that y^ should be the orthogonal projection of the observed y onto that column space. Hence the vector of residuals ε^=y−y^ is orthogonal to every column of X , including the vector of ones 1n if an intercept term is included in the model. As before, this implies the sum of residuals is zero, whence the residual vector's orthogonality with the other columns of X ensures it is uncorrelated with each of those predictors.
But nothing we have done here says anything about the true errorsε . Assuming there is an intercept term in our model, the residuals ε^ are only uncorrelated with x as a mathematical consequence of the manner in which we chose to estimate regression coefficients β^ . The way we selected our β^ affects our predicted values y^ and hence our residuals ε^=y−y^ . If we choose β^ by OLS, we must solve the normal equations and these enforce that our estimated residuals ε^ are uncorrelated with x . Our choice of β^ affects y^ but not E(y) and hence imposes no conditions on the true errors ε=y−E(y) . It would be a mistake to think that ε^ has somehow "inherited" its uncorrelatedness with x from the OLS assumption that ε should be uncorrelated with x . The uncorrelatedness arises from the normal equations.
fonte
Simple example:
The data generating process is:
If we ran that regression, we would get estimatesa^ , b^1 , and b^2 , and with enough data, they would converge on a , b1 , and b2 respectively.
(Technical note: We need a little randomness so we don't buy exactly one bun for each burger we buy at every visit to the grocery store. If we did this,x1 and x2 would be collinear.)
An example of omitted variable bias:
Now let's consider the model:
Observe thatui=b2xi,2+ϵi . Hence
Is this zero? Almost certainly not! The purchase of burgersx1 and the purchase of buns x2 are almost certainly correlated! Hence u and x1 are correlated!
What happens if you tried to run the regression?
If you tried to run:
Your estimateb^1 would almost certainly be a poor estimate of b1 because the OLS regression estimates a^,b^,u^ would be constructed so that u^ and x1 are uncorrelated in your sample. But the actual u is correlated with x1 in the population!
What would happen in practice if you did this? Your estimateb^1 of the price of burgers would ALSO pickup the price of buns. Let's say every time you bought a $1 burger you tended to buy a $0.50 bun (but not all the time). Your estimate of the price of burgers might be $1.40. You'd be picking up the burger channel and the bun channel in your estimate of the burger price.
fonte
Suppose that we're building a regression of the weight of an animal on its height. Clearly, the weight of a dolphin would be measured differently (in different procedure and using different instruments) from the weight of an elephant or a snake. This means that the model errors will be dependent on the height, i.e. explanatory variable. They could be dependent in many different ways. For instance, maybe we tend to slightly overestimate the elephant weights and slightly underestimate the snake's, etc.
So, here we established that it is easy to end up with a situation when the errors are correlated with the explanatory variables. Now, if we ignore this and proceed to regression as usual, we'll notice that the regression residuals are not correlated with the design matrix. This is because, by design the regression forces the residuals to be uncorrelated. Note, also that residuals are not the errors, they're the estimates of errors. So, regardless of whether the errors themselves are correlated or not with the independent variables the error estimates (residuals) will be uncorrelated by the construction of the regression equation solution.
fonte