Calcular probabilidade logarítmica "à mão" para regressão não-linearizada dos mínimos quadrados generalizada (nlme)

12

Estou tentando calcular a probabilidade de log para uma regressão de mínimos quadrados não linear generalizada para a função f(x)=β1(1+xβ2)β3otimizado pelagnlsfunção no pacote Rnlme, usando a matriz de covariância de variância gerada pelas distâncias em uma árvore filogenética assumindo movimento browniano (corBrownian(phy=tree)daapeembalagem). O seguinte código R reprodutível se ajusta ao modelo gnls usando dados x, y e uma árvore aleatória com 9 táxons:

require(ape)
require(nlme)
require(expm)
tree <- rtree(9)
x <- c(0,14.51,32.9,44.41,86.18,136.28,178.21,262.3,521.94)
y <- c(100,93.69,82.09,62.24,32.71,48.4,35.98,15.73,9.71)
data <- data.frame(x,y,row.names=tree$tip.label)
model <- y~beta1/((1+(x/beta2))^beta3)
f=function(beta,x) beta[1]/((1+(x/beta[2]))^beta[3])
start <- c(beta1=103.651004,beta2=119.55067,beta3=1.370105)
correlation <- corBrownian(phy=tree)
fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit) 

Eu gostaria de calcular a probabilidade do log "à mão" (em R, mas sem o uso da logLikfunção) com base nos parâmetros estimados obtidos gnlspara que ele corresponda à saída de logLik(fit). NOTA: Não estou tentando estimar parâmetros; Eu só quero calcular a probabilidade de log dos parâmetros estimados pela gnlsfunção (embora se alguém tiver um exemplo reproduzível de como estimar parâmetros sem gnls, eu ficaria muito interessado em vê-lo!).

Não tenho muita certeza de como fazer isso em R. A notação de álgebra linear descrita em Modelos de efeitos mistos em S e S-Plus (Pinheiro e Bates) está muito acima da minha cabeça e nenhuma das minhas tentativas coincidiu logLik(fit). Aqui estão os detalhes descritos por Pinheiro e Bates:

O log-probabilidade para o não linear dos mínimos quadrados modelo generalizado onde φ i = A i β é calculado como se segue:yi=fi(ϕi,vi)+ϵiϕi=Aiβ

l(β,σ2,δ|y)=12{Nlog(2πσ2)+i=1M[||yifi(β)||2σ2+log|Λi|]}

onde é o número de observações e f i ( β ) = f i ( ϕ i , v i ) .Nfi(β)=fi(ϕi,vi)

é positivo-definido, y i = Λ - T / 2 i y i e f i ( ϕ i , v i ) = Λ - T / 2 i f i ( ϕ i , v i )Λiyi=ΛiT/2yifi(ϕi,vi)=ΛiT/2fi(ϕi,vi)

Para e λ fixos , o estimador de ML de σ 2 éβλσ2

σ^(β,λ)=i=1M||yifi(β)||2/N

e a probabilidade de log com perfil é

l(β,λ|y)=12{N[log(2π/N)+1]+log(i=1M||yifi(β)||2)+i=1Mlog|Λi|}

que é usado com um algoritmo de Gauss-Seidel para encontrar as estimativas de ML de e λ . Uma estimativa menos tendenciosa de σ 2 é usada:βλσ2

σ2=i=1M||Λ^iT/2[yifi(β^)]||2/(Np)

pβ

Eu compilei uma lista de perguntas específicas que estou enfrentando:

  1. Λibig_lambda <- vcv.phylo(tree)apeλ, or something else entirely?
  2. Would σ2 be fit$sigma^2, or the equation for the less biased estimate (the last equation in this post)?
  3. Is it necessary to use λ to calculate log-likelihood, or is that just an intermediate step for parameter estimation? Also, how is λ used? Is it a single value or a vector, and is it multiplied by all of Λi or just off-diagonal elements, etc.?
  4. What is ||yf(β)||? Would that be norm(y-f(fit$coefficients,x),"F") in the package Matrix? If so, I'm confused about how to calculate the sum i=1M||yifi(β)||2, because norm() returns a single value, not a vector.
  5. How does one calculate log|Λi|? Is it log(diag(abs(big_lambda))) where big_lambda is Λi, or is it logm(abs(big_lambda)) from the package expm? If it is logm(), how does one take the sum of a matrix (or is it implied that it is just the diagonal elements)?
  6. Just to confirm, is ΛiT/2 calculated like this: t(solve(sqrtm(big_lambda)))?
  7. How are yi and fi(β) calculated? Is it either of the following:

y_star <- t(solve(sqrtm(big_lambda))) %*% y

and

f_star <- t(solve(sqrtm(big_lambda))) %*% f(fit$coefficients,x)

or would it be

y_star <- t(solve(sqrtm(big_lambda))) * y

and

f_star <- t(solve(sqrtm(big_lambda))) * f(fit$coefficients,x) ?

If all of these questions are answered, in theory, I think the log-likelihood should be calculable to match the output from logLik(fit). Any help on any of these questions would be greatly appreciated. If anything needs clarification, please let me know. Thanks!

UPDATE: I have been experimenting with various possibilities for the calculation of the log-likelihood, and here is the best I have come up with so far. logLik_calc is consistently about 1 to 3 off from the value returned by logLik(fit). Either I'm close to the actual solution, or this is purely by coincidence. Any thoughts?

  C <- vcv.phylo(tree) # variance-covariance matrix
  tC <- t(solve(sqrtm(C))) # C^(-T/2)
  log_C <- log(diag(abs(C))) # log|C|
  N <- length(y)
  y_star <- tC%*%y 
  f_star <- tC%*%f(fit$coefficients,x)
  dif <- y_star-f_star  
  sigma_squared <-  sum(abs(y_star-f_star)^2)/N
  # using fit$sigma^2 also produces a slightly different answer than logLik(fit)
  logLik_calc <- -((N*log(2*pi*(sigma_squared)))+
       sum(((abs(dif)^2)/(sigma_squared))+log_C))/2
Eric
fonte
your definition of the function f(x) is missing an x on the right hand side.
Glen_b -Reinstate Monica

Respostas:

10

Let's start with the simpler case where there is no correlation structure for the residuals:

fit <- gnls(model=model,data=data,start=start)
logLik(fit)

The log likelihood can then be easily computed by hand with:

N <- fit$dims$N
p <- fit$dims$p
sigma <- fit$sigma * sqrt((N-p)/N)
sum(dnorm(y, mean=fitted(fit), sd=sigma, log=TRUE))

Since the residuals are independent, we can just use dnorm(..., log=TRUE) to get the individual log likelihood terms (and then sum them up). Alternatively, we could use:

sum(dnorm(resid(fit), mean=0, sd=sigma, log=TRUE))

Note that fit$sigma is not the "less biased estimate of σ2" -- so we need to make the correction manually first.

Now for the more complicated case where the residuals are correlated:

fit <- gnls(model=model,data=data,start=start,correlation=correlation)
logLik(fit)

Here, we need to use the multivariate normal distribution. I am sure there is a function for this somewhere, but let's just do this by hand:

N <- fit$dims$N
p <- fit$dims$p
yhat <- cbind(fitted(fit))
R <- vcv(tree, cor=TRUE)
sigma <- fit$sigma * sqrt((N-p)/N)
S <- diag(sigma, nrow=nrow(R)) %*% R %*% diag(sigma, nrow=nrow(R))
-1/2 * log(det(S)) - 1/2 * t(y - yhat) %*% solve(S) %*% (y - yhat) - N/2 * log(2*pi)
Wolfgang
fonte
The log-likelihood for the uncorrelated residuals worked perfectly, however I can't figure out the multivariate normal distribution. In this case, what is S? I tried S <- vcv.phylo(tree) and got approximately -700 for the log-likelihood, whereas logLik(fit) was approximately -33.
Eric
Sorry - I messed up when I copy-pasted the code. Now it is complete. S is the variance-covariance matrix of the residuals. You were on the right track (with the vcv function) -- but you need to get the correlation matrix and then use σ^2 to turn this into the var-cov matrix.
Wolfgang