k-significa || tcp K-Means escalável ++

Bahman Bahmani e cols. introduziu k-means ||, que é uma versão mais rápida do k-means ++.

Inicialização de médias k ||

Este algoritmo é retirado da página 4 de seu artigo , Bahmani, B., Moseley, B., Vattani, A., Kumar, R. e Vassilvitskii, S. (2012). K-means escalável ++. Anais da Fundação VLDB , 5 (7), 622-633.

Infelizmente eu não entendo aquelas letras gregas sofisticadas, então preciso de ajuda para entender como isso funciona. Pelo que entendi, esse algoritmo é uma versão aprimorada do k-means ++ e usa superamostragem para reduzir o número de iterações: o k-means ++ precisa iterar vezes, em que é o número de clusters desejados. $k$ $k$

Eu obtive uma explicação muito boa por meio de um exemplo concreto de como o k-means ++ funciona, então usarei o mesmo exemplo novamente.

Exemplo

Eu tenho o seguinte conjunto de dados:

(7,1), (3,4), (1,5), (5,8), (1,3), (7,8), (8,2), (5,9), (8) , 0)

(número de clusters desejados) $k = 3$

(fator de superamostragem) $\ell = 2$

Conjunto de dados de exemplo para médias k ||

Comecei a calculá-lo, mas não tenho certeza se estou certo e não tenho idéia sobre as etapas 2, 4 ou 5.

Etapa 1: mostra um ponto uniformemente aleatoriamente $\mathcal{C} \leftarrow$ $X$

Digamos que o primeiro centróide seja (o mesmo que em k-means ++) $(8,0)$
Etapa 2: $\psi \leftarrow \phi_X(\mathcal{C})$

nenhuma idéia
Etapa 3:
- $d^2(x, \mathcal{C}) = [2, 41, 74, 73, 58, 65, 4, 90]$
  
  Calculamos as distâncias ao quadrado do centro mais próximo de cada ponto. Neste caso, temos apenas um centro até agora, $(8,0)$ .
- $\ell \cdot d^2(x, \mathcal{C}) = [4, 81, 148, 146, 116, 130, 8, 180]$
  
  (Porque neste caso.) $\ell = 2$
- $\text{cumulative } \ell \cdot d^2(x, \mathcal{C}) = [4, 85, 233, 379, 495, 625, 633, 813]$
  
  Escolha números aleatórios no intervalo . Digamos que você escolha e . Eles caem nos intervalos e que correspondem aos 4º e 8º itens, respectivamente. $\ell = 2$ $[0, 813)$ $246.90$ $659.42$ $[379, 495)$ $[633, 813)$
- Repita-o vezes, mas o que é (calculado na Etapa 2) neste caso? $\mathcal{O}(\log \psi)$ $\psi$
Passo 4: Para , conjunto ser o número de pontos em mais próximo do que qualquer outro ponto no . $x \in \mathcal{C}$ $w_x$ $X$ $x$ $\mathcal{C}$
Etapa 5: agrupe novamente os pontos ponderados em em clusters. $\mathcal{C}$ $k$

Qualquer ajuda em geral ou neste exemplo em particular seria ótima.

clustering k-means user1930254
fonte

pontos de dados: (7,1), (3,4), (1,5), (5,8), (1,3), (7,8), (8,2), (5,9) , (8,0)

l = 2 // fator de superamostragem

k = 3 // não. de clusters desejados

Passo 1:

Suponha que o primeiro centróide seja seja . $\mathcal{C}$ $\{ c_1\} = \{ (8,0) \}$ $X = \{x_1,x_2,x_3,x_4,x_5,x_6,x_7,x_8\}=\{(7,1),(3,4),(1,5),(5,8),(1,3),(7,8),(8,2),(5,9)\}$

Passo 2:

é a soma de todas as menores distâncias de 2 normas (distância euclidiana) de todos os pontos do conjunto $\phi_X(\mathcal{C})$ $X$ para todos os pontos a partir de . Em outras palavras, para cada ponto em saber a distância até o ponto mais próximo em , no final calcular a soma de todas essas distâncias mínimas, um para cada ponto . $\mathcal{C}$ $X$ $\mathcal{C}$ $X$

Denotam com como a distância de para o ponto mais próximo em . Temos então $d^2_{\mathcal{C}}(x_i)$ $x_i$ $\mathcal{C}$ . $\psi = \sum_{i=1}^{n}d^2_{\mathcal{C}}(x_i)$

Na etapa 2, contém um único elemento (consulte a etapa 1) e é o conjunto de todos os elementos. Assim, nesta etapa, é simplesmente a distância entre o ponto em e . Assim . $\mathcal{C}$ $X$ $d^2_{\mathcal{C}}(x_i)$ $\mathcal{C}$ $x_i$ $\phi = \sum_{i=1}^{n}{||x_i-c||^2}$

$\psi = \sum_{i=1}^nd^2(x_i,c_1) = 1.41+6.4+8.6+8.54+7.61+8.06+2+9.4 = 52.128$ $log(\psi) = log(52.128) = 3.95 = 4 (rounded)$

Observe, no entanto, que na etapa 3, a fórmula geral é aplicada, pois conterá mais de um ponto. $\mathcal{C}$

Etapa 3:

A por laço é executado para previamente computado. $log(\psi)$

Os desenhos não são como você entendeu. Os desenhos são independentes, o que significa que você irá executar um sorteio para cada ponto . Portanto, para cada ponto em , denotado como , calcule uma probabilidade de . Aqui você tem um fator dado como parâmetro, é a distância até o centro mais próximo, e $X$ $X$ $x_i$ $p_x = l d^2(x,\mathcal{C})/\phi_X(\mathcal{C})$ $l$ $d^2(x,\mathcal{C})$ $\phi_X(\mathcal{C})$ é explicado na etapa 2.

O algoritmo é simplesmente:

$X$ $x_i$
$x_i$ $p_{x_i}$
$[0, 1]$ $p_{x_i}$ $\mathcal{C'}$
$\mathcal{C'}$ $\mathcal{C}$

Note that at each step 3 executed in iteration (line 3 of the original algorithm) you expect to select $l$ points from $X$ (this is easily shown writing directly the formula for expectation).

for(int i=0; i<4; i++) {

  // compute d2 for each x_i
  int[] psi = new int[X.size()];
  for(int i=0; i<X.size(); i++) {
    double min = Double.POSITIVE_INFINITY;
    for(int j=0; j<C.size(); j++) {
      if(min>d2(x[i],c[j])) min = norm2(x[i],c[j]);
    }
    psi[i]=min;
  }

  // compute psi
  double phi_c = 0;
  for(int i=0; i<X.size(); i++) phi_c += psi[i];

  // do the drawings
  for(int i=0; i<X.size(); i++) {
    double p_x = l*psi[i]/phi;
    if(p_x >= Random.nextDouble()) {
      C.add(x[i]);
      X.remove(x[i]);
    }
  }
}
// in the end we have C with all centroid candidates
return C;

Step 4:

A simple algorithm for that is to create a vector $w$ of size equals to the number of elements in $\mathcal{C}$ , and initialize all its values with $0$ . Now iterate in $X$ (elements not selected in as centroids), and for each $x_i \in X$ , find the index $j$ of the closest centroid (element from $\mathcal{C}$ ) and increment $w[j]$ with $1$ . In the end you will have the vector $w$ computed properly.

double[] w = new double[C.size()]; // by default all are zero
for(int i=0; i<X.size(); i++) {
  double min = norm2(X[i], C[0]);
  double index = 0;
  for(int j=1; j<C.size(); j++) {
    if(min>norm2(X[i],C[j])) {
      min = norm2(X[i],C[j]);
      index = j;
    }
  }
  // we found the minimum index, so we increment corresp. weight
  w[index]++;
}

Step 5:

Considering the weights $w$ computed at the previous step, you follow kmeans++ algorithm to select only $k$ points as starting centroids. Thus, you will execute $k$ for loops, at each loop selecting a single element, drawn randomly with probability for each element being $p(i) = w(i)/\sum_{j=1}^m{w_j}$ . At each step you select one element, and remove it from candidates, also removing its corresponding weight.

for(int k=0; k<K; k++) {
  // select one centroid from candidates, randomly, 
  // weighted by w
  // see kmeans++ and you first idea (which is wrong for step 3)
  ... 
}

All the previous steps continues, as in the case of kmeans++, with the normal flow of the clustering algorithm

I hope is clearer now.

[Later, later edit]

I found also a presentation made by authors, where you can not clearly that at each iteration multiple points might be selected. The presentation is here.

[Later edit @pera's issue]

It is obvious that $log(\psi)$ depends on data and the issue you raised would be a real problem if the algorithm would be executed on a single host/machine/computer. However you have to note that this variant of kmeans clustering is dedicated to large problems, and for running on distributed systems. Even more, the authors, in the following paragraphs above the algorithm description state the following:

Notice that the size of $C$ is significantly smaller than the input size; the reclustering can therefore be done quickly. For instance, in MapReduce, since the number of centers is small they can all be assigned to a single machine and any provable approximation algorithm (such as k-means++) can be used to cluster the points to obtain k centers. A MapReduce implementation of Algorithm 2 is discussed in Section 3.5. While our algorithm is very simple and lends itself to a natural parallel implementation (in $log(\psi)$ rounds ), the challenging part is to show that it has provable guarantees.

Another thing to note is the following note on the same page which states:

In practice, our experimental results in Section 5 show that only a few rounds are enough to reach a good solution.

Which means you could run the algorithm not for $log(\psi)$ times, but for a given constant time.

rapaio
fonte

could you please extend your answer with the calculation for my example?

user1930254

I am a programmer, I think I can write that in code faster than typing here :). Hope it explains the algo.

rapaio

Can you please explain what is the idea with log(Ksi) number of iterations? I don't understand the idea beneath it, it seems that the number of iterations will depend on the range of values of objects, which doesn't seem reasonable. For example, if objects have attribute values of about 1000, that could, for example, result that the error is about 1000, which means that there will be 3 iterations. On the other hand, if values are in range of 10 that could result that the error is of about 10 which results in 1 iteration. Shouldn't the number of iterations depend on number of objects?

Marko

@pera I update the answer to clarify the issue you raised

rapaio

@rapaio Thank you for your answer, I'm already going for the solution that will determine the number of iterations based on the number of medoids. Where x can be increased to get better initialization at the cost of a couple of iterations more. Do you fine that ok, based on the second part that you have given? Thanks, again.

Marko

k-significa || tcp K-Means escalável ++

Respostas: