Thursday, February 25, 2016

RSpectra::svds function now default method for sinkr::dineof

The summary blog post describing the RSpectra (formerly rARPACK) package made a convincing case for the improved decomposition speed of the "svds" function for partial SVD (Singular Value Decomposition) over several other R packages. Until now, the sinkr package has relied on irlba for it's dineof function (Data Interpolating Empirical Orthogonal Functions). Since the routine can be quite computationally intensive, I wanted to test the performance of svds as an alternative method.

In a simple example that performs SVD on a field of sea level pressure of the equatorial Pacific, svds outperforms irlba both in speed and correlation of singular vector (e.g. $u) to the output of the base svd function. One can see in the following graph, that trailing vectors break down in their correlation while svds maintains nearly perfect correlation. Interestingly, this artifact is removed by first centering the data field.


While the effect looks dramatic above, it should be noted that the trailing vectors usually carry only a small fraction of information, and thus contribute only marginally to errors in field reconstruction. Below is a figure showing the reconstruction error of svd, svds, and irlba with increasing levels of truncation.


Finally, both methods were compared in their performance within dineof. With the non-centered field both approaches arrive to a similar RMS, but svds converges with less iterations and EOFs than irlba. With the centered data both methods produce nearly identical results.


So, even though the differences are small, the rSpectra::svds method will now be the default method for both the eof and dineof functions within sinkr. For the moment, the irlba method is maintained for compatibility with previous versions.

Code to reproduce:


# packages ----------------------------------------------------------------
library(RSpectra)
library(irlba)
library(sinkr)
library(plyr)
 
 
# data --------------------------------------------------------------------
# full field
data(slp)
F <- slp$field
Fs <- scale(F, center = TRUE, scale = FALSE)
 
# gappy field
gaps <- sample(length(F), length(F)*0.5)
Fg <- replace(F, gaps, NaN)
Fsg <- replace(Fs, gaps, NaN)
 
k <- 20
df <- expand.grid(nu=seq(k), method=c("svd", "svds", "irlba"))
df$cor_w_svd <- NaN
df$recon_rms <- NaN
head(df)
 
k <- 30
system.time(F.svd <- svd(F))
system.time(F.svds <- svds(F, k=k))
system.time(F.irlba <- irlba(F, nv=k, nu=k))
system.time(Fs.svd <- svd(Fs))
system.time(Fs.svds <- svds(Fs, k=k))
system.time(Fs.irlba <- irlba(Fs, nv=k, nu=k))
 
 
 
png("corSVD.png", width=7, height=3.5, units="in", type="cairo", res=400, family="Arial")
op <- par(mfrow=c(1,2), mar=c(3,3,2,1), mgp=c(2,0.5,0), tcl=-0.25, ps=10)
# Un-scaled
plot(abs(diag(cor(F.svd$u[,seq(k)], F.svds$u[,seq(k)]))),
  t="l", ylim=c(0,1),
  ylab="Correlation to svd [abs(R)]", xlab="Singular vectors [u]",
  main="Un-centered field [F]"
)
lines(abs(diag(cor(F.svd$u[,seq(k)], F.irlba$u[,seq(k)]))),
  t="l", col=2, lty=2
)
# Scaled
plot(abs(diag(cor(Fs.svd$u[,seq(k)], Fs.svds$u[,seq(k)]))),
     t="l", ylim=c(0,1),
     ylab="Correlation to svd [abs(R)]", xlab="Singular vectors [u]",
     main="Centered field [Fs]"
)
lines(abs(diag(cor(Fs.svd$u[,seq(k)], Fs.irlba$u[,seq(k)]))),
      t="l", col=2, lty=2
)
# legend
legend("bottomright", legend=c("svds", "irlba"), col=1:2, lty=1:2, bty="n")
par(op)
dev.off()
 
 
 
obj.svd<- c("F.svd", "F.svds", "F.irlba")
ref.svd <- "F.svd"
ref.field <- "F"
df <- expand.grid(nu=seq(k), obj=obj.svd, stringsAsFactors = FALSE)
df$method <- substr(df$obj, 3, 10)
df$svd_cor <- NaN
df$recon_rms <- NaN
# df <- data.frame(nu=seq(k), svd.cor=NaN, svds.cor=NaN, irlba.cor=NaN, svd.rms=NaN, svds.rms=NaN, irlba.rms=NaN)
for(i in seq(nrow(df))){
  OBJ <- get(df$obj[i])
  NU <- df$nu[i]
  df$svd_cor[i] <- abs(cor(get(ref.svd)$u[,df$nu[i]], OBJ$u[,NU]))
  RECON <- (OBJ$u[,seq(NU)]) %*% diag(OBJ$d[seq(NU)],NU,NU) %*% t(OBJ$v[,seq(NU)])
  df$recon_rms[i] <- sqrt(mean((get(ref.field) - RECON)^2, na.rm=TRUE))
}
df
 
 
png("summary_rms.png", width=3.5, height=3.5, units="in", type="cairo", res=400, family="Arial")
op <- par(mar=c(3,3,3,1), mgp=c(2,0.5,0), tcl=-0.25, ps=10)
# RMS with F
df$method <- factor(df$method)
levs <- levels(df$method)
col <- addAlpha(jetPal(length(levs)),0.5)
lty <- seq(length(levs))
lwd <- rep(2,length(levs))
for(i in seq(length(levs))){
  if(i == 1){
    plot(recon_rms~nu, df, t="n", log="y", 
     ylab="RMSE", xlab="Cumulative singular vectors [u]",
     main="Reconstruction error\nof un-centered field [F]")
  }
  lines(recon_rms~nu, df, subset=c(method==levs[i]),
    col=col[i], lty=lty[i], lwd=lwd[i]
  )
  legend("topright", legend=levs, col=col, lty=lty, lwd=lwd, bty="n")
}
par(op)
dev.off()
 
 
delta.rms <- 1e-3
set.seed(1); system.time(Fg.din.irlba <- dineof(Fg, method = "irlba", delta.rms = delta.rms))
set.seed(1); system.time(Fg.din.svds <- dineof(Fg, method = "svds", delta.rms = delta.rms))
set.seed(1); system.time(Fsg.din.irlba <- dineof(Fsg, method = "irlba", delta.rms = delta.rms))
set.seed(1); system.time(Fsg.din.svds <- dineof(Fsg, method = "svds", delta.rms = delta.rms))
 
png("dineofRMS.png", width=7, height=3.5, units="in", type="cairo", res=400, family="Arial")
op <- par(mfrow=c(1,2), mar=c(3,3,3,3), mgp=c(2,0.5,0), tcl=-0.25, ps=10)
# Un-scaled
plot(Fg.din.svds$RMS, log="y", t="l",
     xlim=range(seq(Fg.din.irlba$RMS), seq(Fg.din.svds$RMS)),
     ylim=range(Fg.din.irlba$RMS, Fg.din.svds$RMS),
     ylab="RMSE", xlab="Iteration (n)",
     main="DINEOF performance on\nun-centered gappy field [Fg]"
)
lines(Fg.din.irlba$RMS,
      t="l", col=2, lty=2
)
par(new=TRUE)
plot(Fg.din.svds$NEOF, log="", t="l",
     xlim=range(seq(Fg.din.irlba$NEOF), seq(Fg.din.svds$NEOF)),
     ylim=range(Fg.din.irlba$NEOF, Fg.din.svds$NEOF),
     axes=FALSE, xlab="", ylab=""
)
lines(Fg.din.irlba$NEOF,
      t="l", col=2, lty=2
)
axis(4); mtext("EOFs [n]", side=4, line=2)
legend("top", legend=c("svds", "irlba"), lty=1:2, col=1:2, bty="n")
 
# Scaled
plot(Fsg.din.svds$RMS, log="y", t="l",
     xlim=range(seq(Fsg.din.irlba$RMS), seq(Fsg.din.svds$RMS)),
     ylim=range(Fsg.din.irlba$RMS, Fsg.din.svds$RMS),
     ylab="RMSE", xlab="Iteration (n)",
     main="DINEOF performance on\ncentered gappy field [Fsg]"
)
lines(Fsg.din.irlba$RMS,
      t="l", col=2, lty=2
)
par(new=TRUE)
plot(Fsg.din.svds$NEOF, log="", t="l",
     xlim=range(seq(Fsg.din.irlba$NEOF), seq(Fsg.din.svds$NEOF)),
     ylim=range(Fsg.din.irlba$NEOF, Fsg.din.svds$NEOF),
     axes=FALSE, xlab="", ylab=""
)
lines(Fsg.din.irlba$NEOF,
      t="l", col=2, lty=2
)
axis(4); mtext("EOFs [n]", side=4, line=2)
legend("top", legend=c("svds", "irlba"), lty=1:2, col=1:2, bty="n")
 
par(op)
dev.off()
Created by Pretty R at inside-R.org


3 comments:

  1. Hi Marc,
    Just to let you know that I submitted a new package RSpectra (https://cran.r-project.org/web/packages/RSpectra/index.html) to CRAN,
    which is essentially the same as rARPACK but with the new name to avoid confusion (since rARPACK no longer relies on ARPACK). Would you please also change that in your package?

    Thank you so much!


    Best,
    Yixuan

    ReplyDelete