Em Kahneman e Deaton (2010) † , os autores escrevem o seguinte:
Essa regressão explica 37% da variância, com um erro quadrático médio da raiz (RMSE) de 0,67852. Para eliminar discrepâncias e relatórios de renda implausíveis, retiramos observações nas quais o valor absoluto da diferença entre a receita logarítmica e sua previsão excedeu 2,5 vezes o RMSE.
Isso é prática comum? Qual é a intuição por trás disso? Parece um pouco estranho definir um outlier com base em um modelo que pode não ser bem especificado em primeiro lugar. A determinação de discrepantes não deveria se basear em alguns fundamentos teóricos do que constitui um valor plausível, e não em quão bem o seu modelo prevê os valores reais?
: Daniel Kahneman, Angus Deaton (2010): High income improves evaluation of life but not emotional well-being. Proceedings of the National Academy of Sciences Sep 2010, 107 (38) 16489-16493; DOI: 10.1073/pnas.1011492107
fonte
Respostas:
The reason for dropping this data is stated right there in the quote: namely, to "eliminate outliers and implausible income reports". The fact that they refer to both of these things in conjunction means that they are conceding that at least some of their outliers are not implausible values, and in any case, they give no argument for why values with a high residual should be considered "implausible" income values. By doing this, they are effectively removing data points because the residuals are higher than what is expected in their regression model. As I have stated in another answers here, this is tantamount to requiring reality to conform to your model assumptions, and ignoring parts of reality that are non-compliant with those assumptions.
Whether or not this is a common practice, it is a terrible practice. It occurs because the outlying data points are hard to deal with, and the analyst is unwilling to model them properly (e.g., by using a model that allows higher kurtosis in the error terms), so they just remove parts of reality that don't conform to their ability to undertake statistical modelling. This practice is statistically undesirable and it leads to inferences that systematically underestimate variance and kurtosis in the error terms. The authors of this paper report that they dropped 3.22% of their data due to the removal of these outliers (p. 16490). Since most of these data points would have been very high incomes, this casts substantial doubt on their ability to make robust conclusions about the effect of high incomes (which is the goal of their paper).
fonte