Redução de dimensionalidade (SVD ou PCA) em uma matriz grande e esparsa

31

/ editar: Acompanhamento adicional agora você pode usar o irlba :: prcomp_irlba


/ edit: acompanhando meu próprio post. irlbaagora possui argumentos de "centro" e "escala", que permitem usá-lo para calcular componentes principais, por exemplo:

pc <- M %*% irlba(M, nv=5, nu=0, center=colMeans(M), right_only=TRUE)$v


Eu tenho um grande número Matrixde recursos que gostaria de usar em um algoritmo de aprendizado de máquina:

library(Matrix)
set.seed(42)
rows <- 500000
cols <- 10000
i <- unlist(lapply(1:rows, function(i) rep(i, sample(1:5,1))))
j <- sample(1:cols, length(i), replace=TRUE)
M <- sparseMatrix(i, j)

Como essa matriz possui muitas colunas, eu gostaria de reduzir sua dimensionalidade para algo mais gerenciável. Posso usar o excelente pacote irlba para executar SVD e retornar os primeiros n componentes principais (5 mostrados aqui; provavelmente usarei 100 ou 500 no meu conjunto de dados real):

library(irlba)
pc <- irlba(M, nu=5)$u

No entanto, li que antes de executar o PCA, deve-se centralizar a matriz (subtrair a média da coluna de cada coluna). Isso é muito difícil de fazer no meu conjunto de dados e, além disso, destruiria a esparsidade da matriz.

Quão "ruim" é executar SVD nos dados não dimensionados e alimentá-lo diretamente em um algoritmo de aprendizado de máquina? Existem maneiras eficientes de escalar esses dados, preservando a esparsidade da matriz?


/ edit: Quando B_miner me chamou a atenção, os "PCs" deveriam realmente ser:

pc <- M %*% irlba(M, nv=5, nu=0)$v 

Além disso, acho que a resposta do whuber deve ser bastante fácil de implementar, através da crossprodfunção, que é extremamente rápida em matrizes esparsas:

system.time(M_Mt <- crossprod(M)) # 0.463 seconds
system.time(means <- colMeans(M)) #0.003 seconds

Agora não tenho muita certeza do que fazer com o meansvetor antes de subtraí-lo M_Mt, mas publicarei assim que eu descobrir.


/ edit3: Aqui está a versão modificada do código do whuber, usando operações de matriz esparsa para cada etapa do processo. Se você pode armazenar toda a matriz esparsa na memória, ela funciona muito rapidamente:

library('Matrix')
library('irlba')
set.seed(42)
m <- 500000
n <- 100
i <- unlist(lapply(1:m, function(i) rep(i, sample(25:50,1))))
j <- sample(1:n, length(i), replace=TRUE)
x <- sparseMatrix(i, j, x=runif(length(i)))

n_comp <- 50
system.time({
  xt.x <- crossprod(x)
  x.means <- colMeans(x)
  xt.x <- (xt.x - m * tcrossprod(x.means)) / (m-1)
  svd.0 <- irlba(xt.x, nu=0, nv=n_comp, tol=1e-10)
})
#user  system elapsed 
#0.148   0.030   2.923 

system.time(pca <- prcomp(x, center=TRUE))
#user  system elapsed 
#32.178   2.702  12.322

max(abs(pca$center - x.means))
max(abs(xt.x - cov(as.matrix(x))))
max(abs(abs(svd.0$v / pca$rotation[,1:n_comp]) - 1))

Se você definir o número de colunas para 10.000 e o número de componentes principais para 25, o irlbaPCA baseado em leva cerca de 17 minutos para calcular 50 componentes principais aproximados e consome cerca de 6 GB de RAM, o que não é muito ruim.

Zach
fonte
Zach, curioso se você já resolveu isso.
B_Miner
@B_Miner: Basically, I've been doing SVD without bothering to center or scale first, because I've never found a good way to do this without converting my sparse matrix to a dense matrix. The original matrix %*% the svd's V component gives the "principle components." Sometimes, I get better results if I "fold in" the eigen values, e.g v %*% diag(d), where d is the vector of eigenvalues from the SVD.
Zach
Do you treat v %*% diag(d) by itself or still multiplied by the original matrix X (i.e. X%*% v %*% diag(d)). It seems above you are using the u matrix as the principal component scores?
B_Miner
I use X %*% v %*% diag(d, ncol=length(d)). The v matrix in the svd is equivalent to the "rotation" element of a prcomp object, and X %*% v or X %*% v %*% diag(d, ncol=length(d)) represents the x element of a prcomp object. Take a look a stats:::prcomp.default.
Zach
Yes, X %*% v is the x element from prcomp. It looks like when you use the u matrix as in your question, you are actually using X %*% v %*% diag(1/d).
B_Miner

Respostas:

37

First of all, you really do want to center the data. If not, the geometric interpretation of PCA shows that the first principal component will be close to the vector of means and all subsequent PCs will be orthogonal to it, which will prevent them from approximating any PCs that happen to be close to that first vector. We can hope that most of the later PCs will be approximately correct, but the value of that is questionable when it's likely the first several PCs--the most important ones--will be quite wrong.

So, what to do? PCA proceeds by means of a singular value decomposition of the matrix X. The essential information will be contained in XX, which in this case is a 10000 by 10000 matrix: that may be manageable. Its computation involves about 50 million calculations of dot products of one column with the next.

Consider any two columns, then, Y and Z (each one of them is a 500000-vector; let this dimension be n). Let their means be mY and mZ, respectively. What you want to compute is, writing 1 for the n-vector of 1's,

(YmY1)(ZmZ1)=YZmZ1YmY1.Z+mZmY11=YZn(mYmZ),

because mY=1Y/n and mZ=1Z/n.

This allows you to use sparse matrix techniques to compute XX, whose entries provide the values of YZ, and then adjust its coefficients based on the 10000 column means. The adjustment shouldn't hurt, because it seems unlikely XX will be very sparse.


Example

The following R code demonstrates this approach. It uses a stub, get.col, which in practice might read one column of X at a time from an external data source, thereby reducing the amount of RAM required (at some cost in computation speed, of course). It computes PCA in two ways: via SVD applied to the preceding construction and directly using prcomp. It then compares the output of the two approaches. The computation time is about 50 seconds for 100 columns and scales approximately quadratically: be prepared to wait when performing SVD on a 10K by 10K matrix!

m <- 500000 # Will be 500,000
n <- 100    # will be 10,000
library("Matrix")
x <- as(matrix(pmax(0,rnorm(m*n, mean=-2)), nrow=m), "sparseMatrix")
#
# Compute centered version of x'x by having at most two columns
# of x in memory at any time.
#
get.col <- function(i) x[,i] # Emulates reading a column
system.time({
  xt.x <- matrix(numeric(), n, n)
  x.means <- rep(numeric(), n)
  for (i in 1:n) {
    i.col <- get.col(i)
    x.means[i] <- mean(i.col)
    xt.x[i,i] <- sum(i.col * i.col)
    if (i < n) {
      for (j in (i+1):n) {
        j.col <- get.col(j)
        xt.x[i,j] <- xt.x[j,i] <- sum(j.col * i.col)
      }    
    }
  }
  xt.x <- (xt.x - m * outer(x.means, x.means, `*`)) / (m-1)
  svd.0 <- svd(xt.x / m)
}
)
system.time(pca <- prcomp(x, center=TRUE))
#
# Checks: all should be essentially zero.
#
max(abs(pca$center - x.means))
max(abs(xt.x - cov(x)))
max(abs(abs(svd.0$v / pca$rotation) - 1)) # (This is an unstable calculation.)
whuber
fonte
Thank you for the detailed answer. One of the advantages of irlba is that you can specify nu to limit the algorithm to the first n principle components, which greatly increases it's efficacy and (I think) bypasses the calculation of the XX' matrix.
Zach
1
But what do you want to work with? A sparse 10000 by 500000 matrix with 5×109 coefficients that does not represent the problem you need to solve, or a 10000 by 10000 with 108 coefficients that does represent the problem you want to solve? irlba can be applied to the latter to obtain just the first few principal components, anyway, so you get the best of both worlds.
whuber
I suppose the latter. =). So I need to calculate the dot product for each pair of columns in my sparse matrix, subtract the colMeans of the sparse matrix from the dot product matrix, then run irlba on the result?
Zach
Almost: notice you're subtracting products of column means, not the column means themselves. Your formulation of the algorithm otherwise is excellent, because although abstractly you're computing XX, you don't really want R to create X in order to do the matrix multiplication. Instead, if RAM is really limited, you can perform the column dot products in batches by reading in subsets of the columns at a time. It would be wise to experiment with much smaller matrices at first :-).
whuber
5
I added code to illustrate.
whuber