pontos de dados: (7,1), (3,4), (1,5), (5,8), (1,3), (7,8), (8,2), (5,9) , (8,0)
l = 2 // fator de superamostragem
k = 3 // não. de clusters desejados
Passo 1:
Suponha que o primeiro centróide seja seja { c 1 } = { ( 8 , 0 ) } . X = { x 1 , x 2 , x 3 , x 4 ,C{c1}={(8,0)}X={x1,x2,x3,x4,x5,x6,x7,x8}={(7,1),(3,4),(1,5),(5,8),(1,3),(7,8),(8,2),(5,9)}
Passo 2:
é a soma de todas as menores distâncias de 2 normas (distância euclidiana) de todos os pontos do conjunto XϕX(C)X para todos os pontos a partir de . Em outras palavras, para cada ponto em X saber a distância até o ponto mais próximo em C , no final calcular a soma de todas essas distâncias mínimas, um para cada ponto X .CXCX
Denotam com como a distância de x i para o ponto mais próximo em C . Temos então ψ = ∑d2C(xi)xiC.ψ=∑ni=1d2C(xi)
Na etapa 2, contém um único elemento (consulte a etapa 1) e X é o conjunto de todos os elementos. Assim, nesta etapa, d 2 C ( x i ) é simplesmente a distância entre o ponto em C e x i . Assim ϕ = ∑ n i = 1 | | x i - c | | 2 .CXd2C(xi)Cxiϕ=∑ni=1||xi−c||2
l o g ( ψ ) = l o g ( 52,128 ) = 3,95 = 4 ( r o u n d e dψ=∑ni=1d2(xi,c1)=1.41+6.4+8.6+8.54+7.61+8.06+2+9.4=52.128
log(ψ)=log(52.128)=3.95=4(rounded)
Observe, no entanto, que na etapa 3, a fórmula geral é aplicada, pois conterá mais de um ponto.C
Etapa 3:
A por laço é executado para previamente computado.log(ψ)
Os desenhos não são como você entendeu. Os desenhos são independentes, o que significa que você irá executar um sorteio para cada ponto . Portanto, para cada ponto em X , denotado como x i , calcule uma probabilidade de p x = l d 2 ( x , C ) / ϕ X ( C ) . Aqui você tem l um fator dado como parâmetro, d 2 ( x , C ) é a distância até o centro mais próximo, e φ X ( C )XXxipx=ld2(x,C)/ϕX(C)ld2(x,C)ϕX(C) é explicado na etapa 2.
O algoritmo é simplesmente:
- Xxi
- xipxi
- [0,1]pxiC′
- C′C
Note that at each step 3 executed in iteration (line 3 of the original algorithm) you expect to select l points from X (this is easily shown writing directly the formula for expectation).
for(int i=0; i<4; i++) {
// compute d2 for each x_i
int[] psi = new int[X.size()];
for(int i=0; i<X.size(); i++) {
double min = Double.POSITIVE_INFINITY;
for(int j=0; j<C.size(); j++) {
if(min>d2(x[i],c[j])) min = norm2(x[i],c[j]);
}
psi[i]=min;
}
// compute psi
double phi_c = 0;
for(int i=0; i<X.size(); i++) phi_c += psi[i];
// do the drawings
for(int i=0; i<X.size(); i++) {
double p_x = l*psi[i]/phi;
if(p_x >= Random.nextDouble()) {
C.add(x[i]);
X.remove(x[i]);
}
}
}
// in the end we have C with all centroid candidates
return C;
Step 4:
A simple algorithm for that is to create a vector w of size equals to the number of elements in C, and initialize all its values with 0. Now iterate in X (elements not selected in as centroids), and for each xi∈X, find the index j of the closest centroid (element from C) and increment w[j] with 1. In the end you will have the vector w computed properly.
double[] w = new double[C.size()]; // by default all are zero
for(int i=0; i<X.size(); i++) {
double min = norm2(X[i], C[0]);
double index = 0;
for(int j=1; j<C.size(); j++) {
if(min>norm2(X[i],C[j])) {
min = norm2(X[i],C[j]);
index = j;
}
}
// we found the minimum index, so we increment corresp. weight
w[index]++;
}
Step 5:
Considering the weights w computed at the previous step, you follow kmeans++ algorithm to select only k points as starting centroids. Thus, you will execute k for loops, at each loop selecting a single element, drawn randomly with probability for each element being p(i)=w(i)/∑mj=1wj. At each step you select one element, and remove it from candidates, also removing its corresponding weight.
for(int k=0; k<K; k++) {
// select one centroid from candidates, randomly,
// weighted by w
// see kmeans++ and you first idea (which is wrong for step 3)
...
}
All the previous steps continues, as in the case of kmeans++, with the normal flow of the clustering algorithm
I hope is clearer now.
[Later, later edit]
I found also a presentation made by authors, where you can not clearly that at each iteration multiple points might be selected. The presentation is here.
[Later edit @pera's issue]
It is obvious that log(ψ) depends on data and the issue you raised would be a real problem if the algorithm would be executed on a single host/machine/computer. However you have to note that this variant of kmeans clustering is dedicated to large problems, and for running on distributed systems. Even more, the authors, in the following paragraphs above the algorithm description state the following:
Notice that the size of C is significantly smaller than the input size; the reclustering can therefore be done quickly. For instance, in MapReduce, since the number of centers is small they can all be assigned to a single machine and any provable approximation algorithm (such as k-means++) can be used to cluster the points to obtain
k centers.
A MapReduce implementation of Algorithm 2 is discussed in Section 3.5.
While our algorithm is very simple and lends itself to a natural parallel implementation (in log(ψ) rounds ), the challenging part is to show that it has provable guarantees.
Another thing to note is the following note on the same page which states:
In practice, our experimental results in Section 5 show that
only a few rounds are enough to reach a good solution.
Which means you could run the algorithm not for log(ψ) times, but for a given constant time.