Correlação de Pearson dos conjuntos de dados com desvio padrão possivelmente zero?

12

Estou com problemas para calcular o coeficiente de correlação de Pearson dos conjuntos de dados com desvio padrão possivelmente zero (ou seja, todos os dados têm o mesmo valor).

Suponha que eu tenha os dois conjuntos de dados a seguir:

float x[] = {2, 2, 2, 3, 2};
float y[] = {2, 2, 2, 2, 2};

O coeficiente de correlação "r" seria calculado usando a seguinte equação:

float r = covariance(x, y) / (std_dev(x) * std_dev(y));

No entanto, como todos os dados no conjunto de dados "y" têm o mesmo valor, o desvio padrão std_dev (y) seria zero e "r" seria indefinido.

Há alguma solução para esse problema? Ou devo usar outros métodos para medir a relação de dados nesse caso?

correlation Andree
fonte

Não há "relacionamento de dados" neste exemplo porque y não varia. Atribuir qualquer valor numérico a r seria um erro.

whuber

1

@ whuber - é verdade que

é indefinido, mas não necessariamente que a correlação desconhecida "verdadeira"

não possa ser estimada. Só tenho que usar algo diferente para estimar.

r

$r$

ρ

$\rho$

probabilityislogic

@probability Você pressupõe que este é um problema de estimativa e não apenas de caracterização. Mas aceitando isso, qual estimador você proporia no exemplo? Nenhuma resposta pode ser universalmente correta porque depende de como o estimador será usado (uma função de perda, com efeito). Em muitas aplicações, como o PCA, parece provável que o uso de qualquer procedimento que imputa um valor a

seja pior do que outros procedimentos que reconhecem

não possam ser identificados.

ρ

$\rho$

ρ

$\rho$

whuber

1

@whuber - a estimativa é uma má escolha de palavras para mim (você deve ter notado que eu não sou o melhor especialista em palavras), o que eu quis dizer foi que, embora

possa não ser identificado de maneira única, isso não significa que os dados sejam inúteis para informar nós sobre

. Minha resposta dá uma demonstração (feia) disso de um ponto de vista algébrico.

ρ

$\rho$

ρ

$\rho$

probabilityislogic

@ Probabilidade Parece que sua análise é contraditória: se de fato y é modelado com uma distribuição normal, uma amostra de cinco 2 mostra que esse modelo é inadequado. Por fim, você não recebe nada por nada: seus resultados dependem fortemente das suposições feitas sobre os anteriores. Os problemas originais na identificação de

ainda estão lá, mas foram ocultados por todas essas suposições adicionais. Isso parece IMHO apenas para obscurecer as questões ao invés de esclarecê-las.

ρ

$\rho$

whuber

9

O pessoal da "teoria da amostragem" dirá que essa estimativa não existe. Mas você pode obter um, apenas precisa ser razoável quanto às informações anteriores e fazer um trabalho matemático muito mais difícil.

Se você especificou um método bayesiano de estimativa, e o posterior é o mesmo que o anterior, é possível dizer que os dados não dizem nada sobre o parâmetro. Como as coisas podem ficar "singulares" para nós, não podemos usar espaços de parâmetros infinitos. Estou assumindo que, porque você usa a correlação de Pearson, você tem uma probabilidade normal bivariada:

onde

p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) = {(σ_{x} σ_{y} \sqrt{2 π (1 - ρ^{2})})}^{- N} e x p (- \frac{\sum_{i} Q_{i}}{2 (1 - ρ^{2})})

$p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)=\left(\sigma_x\sigma_y\sqrt{2\pi(1-\rho^2)}\right)^{-N}exp\left(-\frac{\sum_{i}Q_i}{2(1-\rho^2)}\right)$

Q_{i} = \frac{(x_{i} - μ_{x})^{2}}{σ_{x}^{2}} + \frac{(y_{i} - μ_{y})^{2}}{σ_{y}^{2}} - 2 ρ \frac{(x_{i} - μ_{x}) (y_{i} - μ_{y})}{σ_{x} σ_{y}}

$Q_i=\frac{(x_i-\mu_x)^2}{\sigma_x^2}+\frac{(y_i-\mu_y)^2}{\sigma_y^2}-2\rho\frac{(x_i-\mu_x)(y_i-\mu_y)}{\sigma_x\sigma_y}$

Agora, para indicar que um conjunto de dados pode ter o mesmo valor, escreva e obtemos: $y_i=y$

onde

\sum_{i} Q_{i} = N [\frac{(y - μ_{y})^{2}}{σ_{y}^{2}} + \frac{s_{x}^{2} + (\bar{x} - μ_{x})^{2}}{σ_{x}^{2}} - 2 ρ \frac{(\bar{x} - μ_{x}) (y - μ_{y})}{σ_{x} σ_{y}}]

$\sum_{i}Q_i=N\left[\frac{(y-\mu_y)^2}{\sigma_y^2}+\frac{s_x^2 + (\overline{x}-\mu_x)^2}{\sigma_x^2}-2\rho\frac{(\overline{x}-\mu_x)(y-\mu_y)}{\sigma_x\sigma_y}\right]$

s_{x}^{2} = \frac{1}{N} \sum_{i} (x_{i} - \bar{x})^{2}

$s_x^2=\frac{1}{N}\sum_{i}(x_i-\overline{x})^2$

E assim a sua probabilidade depende de quatro números, . Então, você deseja uma estimativa de , portanto, é necessário multiplicar por um anterior e integrar os parâmetros de incômodo . Agora, para nos prepararmos para a integração, "completamos o quadrado" $s_x^2,y,\overline{x},N$ $\rho$ $\mu_x,\mu_y,\sigma_x,\sigma_y$

\frac{\sum_{i} Q_{i}}{1 - ρ^{2}} = N [\frac{{(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{σ_{y}^{2} (1 - ρ^{2})} + \frac{s_{x}^{2}}{σ_{x}^{2} (1 - ρ^{2})} + \frac{(\bar{x} - μ_{x})^{2}}{σ_{x}^{2}}]

$\frac{\sum_{i}Q_i}{1-\rho^2}=N\left[\frac{\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{\sigma_y^2(1-\rho^{2})}+\frac{s_x^2}{\sigma_{x}^{2}(1-\rho^{2})} + \frac{(\overline{x}-\mu_x)^2}{\sigma_x^2}\right]$

Agora devemos errar por precaução e garantir uma probabilidade adequadamente normalizada. Dessa forma, não podemos ter problemas. Uma dessas opções é usar um prior fracamente informativo, que apenas restringe o intervalo de cada um. Portanto, temos para as médias com plano anterior e para os desvios padrão com jeffreys anteriores. É fácil estabelecer esses limites com um pouco de "senso comum" pensando sobre o problema. Vou pegar um anterior não especificado para $L_{\mu}<\mu_x,\mu_y<U_{\mu}$ $L_{\sigma}<\sigma_x,\sigma_y<U_{\sigma}$ $\rho$ , e assim obtemos (uniforme deve funcionar ok, se não truncar a singularidade em ): $\pm 1$

p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) = \frac{p (ρ)}{A σ_{x} σ_{y}}

$p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)=\frac{p(\rho)}{A\sigma_x\sigma_y}$

Onde . Isso fornece um posterior de: $A=2(U_{\mu}-L_{\mu})^{2}[log(U_{\sigma})-log(L_{\sigma})]^{2}$

p (ρ | D) = \int p (ρ, μ_{x}, μ_{y}, σ_{x}, σ_{y}) p (D | μ_{x}, μ_{y}, σ_{x}, σ_{y}, ρ) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$p(\rho|D)=\int p(\rho,\mu_x,\mu_y,\sigma_x,\sigma_y)p(D|\mu_x,\mu_y,\sigma_x,\sigma_y,\rho)d\mu_y d\mu_x d\sigma_x d\sigma_y$

= \frac{p (ρ)}{A [2 π (1 - ρ^{2})]^{\frac{N}{2}}} \int_{L_{σ}}^{U_{σ}} \int_{L_{σ}}^{U_{σ}} {(σ_{x} σ_{y})}^{- N - 1} e x p (- \frac{N s_{x}^{2}}{2 σ_{x}^{2} (1 - ρ^{2})}) \times

$=\frac{p(\rho)}{A[2\pi(1-\rho^2)]^{\frac{N}{2}}}\int_{L_{\sigma}}^{U_{\sigma}}\int_{L_{\sigma}}^{U_{\sigma}}\left(\sigma_x\sigma_y\right)^{-N-1}exp\left(-\frac{N s_x^2}{2\sigma_{x}^{2}(1-\rho^{2})}\right) \times$

\int_{L_{μ}}^{U_{μ}} e x p (- \frac{N (\bar{x} - μ_{x})^{2}}{2 σ_{x}^{2}}) \int_{L_{μ}}^{U_{μ}} e x p (- \frac{N {(μ_{y} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}])}^{2}}{2 σ_{y}^{2} (1 - ρ^{2})}) d μ_{y} d μ_{x} d σ_{x} d σ_{y}

$\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N(\overline{x}-\mu_x)^2}{2\sigma_x^2}\right)\int_{L_{\mu}}^{U_{\mu}}exp\left(-\frac{N\left(\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]\right)^2}{2\sigma_y^2(1-\rho^{2})}\right)d\mu_y d\mu_x d\sigma_x d\sigma_y$

Now the first integration over $\mu_y$ can be done by making a change of variables $z=\sqrt{N}\frac{\mu_y-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\sigma_y\sqrt{1-\rho^{2}}}\implies dz=\frac{\sqrt{N}}{\sigma_y\sqrt{1-\rho^{2}}}d\mu_y$ and the first integral over $\mu_y$ becomes:

\frac{σ_{y} \sqrt{2 π (1 - ρ^{2})}}{\sqrt{N}} [Φ (\frac{U_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}}) - Φ (\frac{L_{μ} - [y - (\bar{x} - μ_{x}) \frac{ρ σ_{y}}{σ_{x}}]}{\frac{σ_{y}}{\sqrt{N}} \sqrt{1 - ρ^{2}}})]

$\frac{\sigma_y\sqrt{2\pi(1-\rho^{2})}}{\sqrt{N}}\left[\Phi\left( \frac{U_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)-\Phi\left( \frac{L_{\mu}-\left[y-(\overline{x}-\mu_x)\frac{\rho\sigma_y}{\sigma_x}\right]}{\frac{\sigma_y}{\sqrt{N}}\sqrt{1-\rho^{2}}} \right)\right]$

And you can see from here, no analytic solutions are possible. However, it is also worthwhile to note that the value $\rho$ has not dropped out of the equations. This means that the data and prior information still have something to say about the true correlation. If the data said nothing about the correlation, then we would be simply left with $p(\rho)$ as the only function of $\rho$ in these equations.

It also shows how that passing to the limit of infinite bounds for $\mu_y$ "throws away" some of the information about $\rho$ , which is contained in the complicated looking normal CDF function $\Phi(.)$ . Now if you have a lot of data, then passing to the limit is fine, you don't loose much, but if you have very scarce information, such as in your case - it is important keep every scrap you have. It means ugly maths, but this example is not too hard to do numerically. So we can evaluate the integrated likelihood for $\rho$ at values of say $-0.99,-0.98,\dots,0.98,0.99$ fairly easily. Just replace the integrals by summations over a small enough intervals - so you have a triple summation

probabilityislogic
fonte

@probabilityislogic: Wow. Simply wow. After seen some of your answers I really wonder: what should a doofus like me do to reach such a flexible bayesian state of mind ?

steffen

1

@steffen - lol. Its not that difficult, you just need to practice. And always always always remember that the product and sum rules of probability are the only rules you will ever need. They will extract whatever information is there - whether you see it or not. So you apply product and sum rules, then just do the maths. That is all I have done here.

probabilityislogic

@steffen - and the other rule - more a mathematical one than stats one - don't pass to an infinite limit too early in your calculations, your results may become arbitrary, or little details may get thrown out. Measurement error models are a perfect example of this (as is this question).

probabilityislogic

@probabilityislogic: Thank you, I'll keep this in mind... as soon as I am done working through my "Bayesian Analysis"-copy ;).

steffen

@probabilityislogic: If you could humor a nonmathematical statistician/researcher...would it be possible to summarize or translate your answer to a group of dentists or high school principals or introductory statistics students?

rolando2

6

I agree with sesqu that the correlation is undefined in this case. Depending on your type of application you could e.g. calculate the Gower Similarity between both vectors, which is: $gower(v1,v2)=\frac{\sum_{i=1}^{n}\delta(v1_i,v2_i)}{n}$ where $\delta$ represents the kronecker-delta, applied as function on $v1,v2$ .

So for instance if all values are equal, gower(.,.)=1. If on the other hand they differ only in one dimension, gower(.,.)=0.9. If they differ in every dimension, gower(.,.)=0 and so on.

Of course this is no measure for correlation, but it allows you to calculate how close the vector with s>0 is to the one with s=0. Of course you can apply other metrics,too, if they serve your purpose better.

steffen
fonte

+1 That's a creative idea. It sounds like the "Gower Similarity" is a scaled Hamming distance.

whuber

@whuber: Indeed it is !

steffen

0

The correlation is undefined in that case. If you must define it, I would define it as 0, but consider a simple mean absolute difference instead.

sesqu
fonte

0

This question is coming from programmers, so I'd suggest plugging in zero. There's no evidence of a correlation, and the null hypothesis would be zero (no correlation). There might be other context knowledge that would provide a "typical" correlation in one context, but the code might be re-used in another context.

zbicyclist
fonte

2

There's no evidence of lack of correlation either, so why not plug in 1? Or -1? Or anything in between? They all lead to re-usable code!

whuber

@whuber - you plug in zero because the data is "less constrained" when it is independent - this is why maxent distributions are independent unless you explicitly specify correlations in the constraints. Independence can be viewed as a conservative assumption when you know of no such correlations - effectively you are averaging over all possible correlations.

probabilityislogic

1

@prob I question why it makes sense as a generic procedure to average over all correlations. In effect this procedure substitutes the definite and possibly quite wrong answer "zero!" for the correct answer "the data don't tell us." That difference can be important for decision making.

whuber

Just because the question might be from a programmer, does not mean you should convert an undefined value to zero. Zero means something specific in a correlation calculation. Throw an exception. Let the caller decide what should happen. Your function should calculate a correlation, not decide what to do if one cannot be computed.

Jared Becksfort

Correlação de Pearson dos conjuntos de dados com desvio padrão possivelmente zero?

Respostas: