Houve alguma confusão na minha cabeça sobre dois tipos de estimadores do valor populacional do coeficiente de correlação de Pearson.
A. Fisher (1915) mostrou que para a população normal bivariada, o empírico é um estimador negativamente tendencioso de ρ , embora o viés possa ser de uma quantidade praticamente considerável apenas para o tamanho pequeno da amostra ( n < 30 ). A amostra r subestima ρ no sentido de que está mais próxima de 0 que ρ . (Exceto quando o último é 0 ou ± 1 , para então r é imparcial.) Vários estimadores quase imparciais de ρ foram propostos, sendo o melhor provavelmente o Olkin e Pratt (1958) corrigiram :
is positively biased relative to , meaning absolute value: is farther from than (is that statement true?). The texts say it is the same problem as the over-estimation of the standard deviation parameter by its sample value. There exist many formulas to "adjust" observed closer to its population parameter, being the most well-known (but not the best). The root of such adjusted is called shrunken :
Present are two different estimators of . Very different: the first one inflates , the second deflates . How to reconcile them? Where to use/report one and where - the other?
In particular, can it be true that the "shrunken" estimator is (nearly) unbiased too, like the "unbiased" one, but only in the different context - in the asymmetrical context of regression. For, in OLS regression we consider the values of one side (the predictor) as fixed, attending without random error from sample to sample? (And to add here, regression does not need bivariate normality.)
Respostas:
Regarding the bias in the correlation: When sample sizes are small enough for bias to have any practical significance (e.g., the n < 30 you suggested), then bias is likely to be the least of your worries, because inaccuracy is terrible.
Regarding the bias of R2 in multiple regression, there are many different adjustments that pertain to unbiased population estimation vs. unbiased estimation in an independent sample of equal size. See Yin, P. & Fan, X. (2001). Estimating R2 shrinkage in multiple regression: A comparison of analytical methods. The Journal of Experimental Education, 69, 203-224.
Modern-day regression methods also address the shrinkage of regression coefficients as well as R2 as a consequence -- e.g., the elastic net with k-fold cross validation, see http://web.stanford.edu/~hastie/Papers/elasticnet.pdf.
fonte
I think the answer is in the context of simple regression and multiple regression. In simple regression with one IV and one DV, the R sq is not positively biased, and in-fact may be negatively biased given r is negatively biased. But in multiple regression with several IV's which may be correlated themselves, R sq may be positively biased because of any "suppression" that may be happening. Thus, my take is that observed R2 overestimates the corresponding population R-square, but only in multiple regression
fonte
R sq is not positively biased, and in-fact may be negatively biased
Interesting. Can you show it or give a reference? - In bivariate normal population, can observed sample Rsq statistic be negatively biased estimator?