Thursday, June 30, 2011

Clarke and Ainsworth's BIOENV and BVSTEP (and BIO-BIO etc...)

Nonmetric Multidimensional Scaling (NMDS) plot of vegetation sample dissimilarities with best correlating environmental variables (left) and species (right) plotted as vectors (datasets "varespec" and "varechem" from the package vegan)

The R package "vegan" contains a version of Clarke and Ainsworth's (1993) BIOENV analysis allowing for the comparison of distance/similarity matrices between two sets of data having either samples or variables in common. The typical setup is in the exploration of environmental variables that best correlate to sample similarities of the biological community (e.g. species biomass or abundance). In this case, the similarity matrix of the community is fixed, while subsets of the environmental variables are used in the calculation of the environmental similarity matrix. A correlation coefficient (typically Spearman rank correlation coefficient) is then calculated between the two matrices and the best subset of environmental variables can then be identified and further subjected to a permutation test to determine significance.

This can be a very helpful analysis in the exploration of the often highly dimensional space of community samples. The method is also widely accepted by the scientific community due to its flexibility across a wide variety of data and is completely non-parametric - Clarke and Ainsworth's (1993) paper describing the method has 674 citations on Google Scholar at the time of this posting. 

The R package "vegan" incorporates this routine in the function bioenv(). An example of a BIOENV exploration between vegetation community data (dataset "varespec" in the vegan package) and the environmental data (dataset "varechem" in the vegan package) :


###
require(vegan)
data(varespec)
data(varechem)
alt.varechem <- varechem 
alt.varechem$N <- log(alt.varechem$N)
res<-bioenv(wisconsin(varespec), alt.varechem)
> res

Call:
bioenv(comm = wisconsin(varespec), env = alt.varechem) 

Subset of environmental variables with best correlation to community data.

Correlations:      spearman 
Dissimilarities:   bray 

Best model has 7 parameters (max. 14 allowed):
P Mg Al Fe Mn Mo Humdepth
with correlation  0.5031087 

> summary(res)
                                                  size correlation
P                                                    1      0.2513
P Al                                                 2      0.4004
P Al Mn                                              3      0.4573
P Fe Mn Mo                                           4      0.4798
P Mg Al Mn Baresoil                                  5      0.4977
P Mg Fe Mn Mo Humdepth                               6      0.5004
P Mg Al Fe Mn Mo Humdepth                            7      0.5031
P Mg Al Fe Mn Mo Baresoil Humdepth                   8      0.5024
P Ca Mg Al Fe Mn Mo Baresoil Humdepth                9      0.4998
N P Ca Mg Al Fe Mn Mo Baresoil Humdepth             10      0.4928
N P Ca Mg Al Fe Mn Zn Mo Baresoil Humdepth          11      0.4798
N P Ca Mg Al Fe Mn Zn Mo Baresoil Humdepth pH       12      0.4566
N P Ca Mg S Al Fe Mn Zn Mo Baresoil Humdepth pH     13      0.4380
N P K Ca Mg S Al Fe Mn Zn Mo Baresoil Humdepth pH   14      0.4169
###

Due to the inflexibility of the bioenv() function, one has little control over how the variable similarity matrix is calculated (derived from the environmental subsets in the above example) as the method assumes the subset data to be environmental and that the resulting similarity matrix should be based on normalized "euclidean" distances. This makes sense with environmental data where one normalizes the data to remove the effect of differing units between parameters, yet in cases where the variable matrix is biological one might want more flexibility (a Bray-Curtis measure of similarity is common given its non-parametric nature). The vegan function vegdist() comes with many other possible indices that could be applied ("manhattan", "euclidean", "canberra", "bray", "kulczynski", "jaccard", "gower", "altGower", "morisita", "horn", "mountford", "raup", "binomial" and "chao"). For example, beyond the typical biological to environmental comparison (BIOENV setup), one can also use the routine to explore other other types of relationships; e.g.:

ENVBIO: subset of biological variables that best correlate to the overall environmental pattern
BIOBIO: subset of biological variables that best correlate to the overall biological pattern
ENVENV: subset of environmental variables that best correlate to the overall environmental pattern

In the latter two examples one can identify a smaller subset of variables that best capture the overall sample similarities to a defined level of correlation. For the ENVBIO and BIOBIO cases, a more flexable setup is desired. I have written a version of the routine called bio.env() (below) that allows this flexibility. The output of te results are similar to the bioenv() function, yet also include a dataframe of the top correlated variable combinations. Performing the same analysis as above, the bio.env() output is as follows:

###
res <- bio.env(wisconsin(varespec), alt.varechem, 
fix.dist.method="bray", var.dist.method="euclidean",
scale.fix=FALSE, scale.var=TRUE)

> res 
$order.by.best
                var.incl n.var       rho
1        2,5,7,8,9,11,13     7 0.5031087
2     2,5,7,8,9,11,12,13     8 0.5024283
3          2,5,8,9,11,13     6 0.5003935
4   2,4,5,7,8,9,11,12,13     9 0.4998057
5           2,5,7,8,9,13     6 0.4983690
6             2,5,7,9,12     5 0.4976881
7  2,4,7,8,9,10,11,12,13     9 0.4969415
8      2,4,5,7,8,9,12,13     8 0.4963091
9      2,4,5,7,8,9,11,12     8 0.4962737
10      2,5,8,9,11,12,13     7 0.4961008

$order.by.i.comb
                           var.incl n.var       rho
1                                 2     1 0.2514019
2                               2,7     2 0.4003778
3                             2,7,9     3 0.4572635
4                          2,8,9,11     4 0.4798328
5                        2,5,7,9,12     5 0.4976881
6                     2,5,8,9,11,13     6 0.5003935
7                   2,5,7,8,9,11,13     7 0.5031087
8                2,5,7,8,9,11,12,13     8 0.5024283
9              2,4,5,7,8,9,11,12,13     9 0.4998057
10           1,2,4,5,7,8,9,11,12,13    10 0.4927966
11        1,2,4,5,7,8,9,10,11,12,13    11 0.4797848
12     1,2,4,5,7,8,9,10,11,12,13,14    12 0.4566453
13   1,2,4,5,6,7,8,9,10,11,12,13,14    13 0.4379532
14 1,2,3,4,5,6,7,8,9,10,11,12,13,14    14 0.4169410

$best.model.vars
[1] "P,Mg,Al,Fe,Mn,Mo,Humdepth"

$best.model.rho
[1] 0.5031087
###

It is important to mention here that one of the reasons why a variable biological similarity matrix is often less explored with the routine is that the number of possible subset combinations becomes computationally overwhelming when the number of species/groups is large - the total number of combinations being equal to 2^n - 1, where n is the total number of variables. For this reason, Clarke and Warwick (1998) presented a stepwise routine (BVSTEP) for a faster exploration of the subset combinations. They specifically looked at the BIOBIO type exploration and addressed the concept of structural redundancy in community composition through the identification of "response units", or Taxonomic/functional groupings of species that changed in abundance in the same way over time. 

I have created another function called bv.step() (below) to do this stepwise exploration according to the description of the "forward selection/backward elimination" algorithm described by Clarke and Warwick (1998). As an example of the function, a BIOBIO type exploration of the varespec data was conducted to identify a subset of species that best describe the overall community pattern. This was done in two steps where I first allowed the algorithm to run for 50 times, randomly selecting 30% of the 44 possible species for inclusion in the stepwise exploration. From this initial information I was able to identify some of the most important species that should always be included in the routine (i.e. inclusion of species 23, 26, and 29 correlated at rho=0.68 with the full set), and then performed a "second run" of bv.step() using the option "var.always.include=c(23,26,29)". In this way, I was able to arrive at a good representation of sample similarities using a much reduced number of species. By the way, this routine would have taken agonizingly long in bio.env() as there are 2^44-1 (1.759219e+13) possible combinations to test.

###
#first round
res.bv.step.biobio <- bv.step(wisconsin(varespec), wisconsin(varespec), 
fix.dist.method="bray", var.dist.method="bray",
scale.fix=FALSE, scale.var=FALSE, 
max.rho=0.95, min.delta.rho=0.001,
random.selection=TRUE,
prop.selected.var=0.3,
num.restarts=50,
output.best=10,
var.always.include=NULL)
> res.bv.step.biobio
$order.by.best
                              var.incl n.var       rho
1       4,7,13,22,23,24,25,27,29,38,39    11 0.8299710
2     1,4,7,13,22,23,24,25,27,29,38,39    12 0.8278529
3          4,7,13,22,23,24,27,29,38,39    10 0.8266143
4             4,7,22,23,24,27,29,38,39     9 0.8211150
5             4,7,13,22,23,24,29,38,39     9 0.8179912
6          4,5,22,23,26,29,39,40,43,44    10 0.8167332
7             4,5,22,23,26,29,39,43,44     9 0.8141779
8        4,5,9,22,23,26,29,39,40,43,44    11 0.8130295
9  13,15,17,23,24,26,28,35,38,39,43,44    12 0.8127926
10    13,15,23,24,26,28,35,38,39,43,44    11 0.8127812

$order.by.i.comb
                           var.incl n.var       rho
1                                26     1 0.5186498
2                             23,26     2 0.6100455
3                          23,26,29     3 0.6843669
4                        4,23,26,29     4 0.7343000
5                      4,5,23,26,29     5 0.7729262
6                   4,5,22,23,26,29     6 0.7855065
7                4,5,22,23,26,29,43     7 0.8039606
8             4,5,22,23,26,29,39,43     8 0.8105621
9          4,7,22,23,24,27,29,38,39     9 0.8211150
10      4,7,13,22,23,24,27,29,38,39    10 0.8266143
11   4,7,13,22,23,24,25,27,29,38,39    11 0.8299710
12 1,4,7,13,22,23,24,25,27,29,38,39    12 0.8278529

$best.model.vars
[1] "Vac.myr,Des.fle,Dic.pol,Cla.arb,Cla.ran,Cla.ste,Cla.unc,Cla.cor,Cla.fim,Nep.arc,Ste.sp"

$best.model.rho
[1] 0.829971

$var.always.include
NULL

$var.exclude
[1]  8 20 21 32


#second round
res.bv.step.biobio2  <- bv.step(wisconsin(varespec), wisconsin(varespec), 
fix.dist.method="bray", var.dist.method="bray",
scale.fix=FALSE, scale.var=FALSE, 
max.rho=0.95, min.delta.rho=0.001,
random.selection=TRUE,
prop.selected.var=0.2,
num.restarts=50,
output.best=10,
var.always.include=c(23,26,29))
> res.bv.step.biobio2 
$order.by.best
                       var.incl n.var       rho
1     2,15,19,23,24,26,29,36,43     9 0.8526819
2  2,15,19,23,24,26,29,34,36,43    10 0.8509291
3        2,15,19,23,24,26,29,43     8 0.8437539
4          15,19,23,24,26,29,43     7 0.8338602
5     5,11,14,15,19,23,26,29,30     9 0.8295133
6        5,11,15,19,23,26,29,30     8 0.8277205
7       15,23,26,29,37,38,41,43     8 0.8272279
8  5,11,14,15,19,23,25,26,29,30    10 0.8240510
9     2,15,23,26,29,37,38,41,43     9 0.8228758
10          5,11,15,19,23,26,29     7 0.8206538

$order.by.i.comb
                           var.incl n.var       rho
1                                26     1 0.5186498
2                             23,26     2 0.6100455
3                          15,26,29     3 0.6873777
4                       15,23,26,29     4 0.7468244
5                    15,23,26,29,42     5 0.7968556
6                 15,19,23,24,26,29     6 0.8182366
7              15,19,23,24,26,29,43     7 0.8338602
8            2,15,19,23,24,26,29,43     8 0.8437539
9         2,15,19,23,24,26,29,36,43     9 0.8526819
10     2,15,19,23,24,26,29,34,36,43    10 0.8509291
11 11,12,14,23,26,29,30,36,37,41,42    11 0.8116066

$best.model.vars
[1] "Emp.nig,Ple.sch,Poh.nut,Cla.ran,Cla.ste,Cla.coc,Cla.fim,Cet.isl,Cla.def"

$best.model.rho
[1] 0.8526819

$var.always.include
[1] 23 26 29

$var.exclude
[1]  8 20 21 32
###


I have compared the output of my bv.step() function to that of BVSTEP included in Clarke and Warwick's PRIMER software and find the results to be similar. The BVSTEP seems to be a bit more efficient in finding the most optimal correlation without needing this two-step exploration. For example, the BVSTEP routine may still arrive at a best correlated subset with a higher number of variables than were originally defined by the "% of variables" setting. In other words, the BVSTEP allows variables not initially chosen by the randomization to enter into the stepwise exploration at later iterations - how this is accomplished I'm not sure...
Another issue is the initial exclusion of parameters. The authors state that the BVSTEP algorithm contains an "initial elimination phase, removing several of the starting set, followed by a steady forward selection phase (with only occasional backward steps), adding species until the rho = 0.95 threshold is attained". My interpretation of this was a simple single backward removal phase where I check to see if the removal of any variables from the full set makes any substantial change in correlation (the threshhold defined by the "min.delta.rho" setting). In the above example, variables 8,20,21, and 32 were excluded as their removal had little affect on the similarity distances. It is also possible that this initial elimination phase could be optimized - any suggestions on how to optimize the routine are very welcome!

My whole reason for getting into this theme was that someone had asked me how one can identify the species that are most responsible for the similarity distances between samples. I'm still familiarizing myself with the excellent vegan package, so there are possibly other (more appropriate?) methods for addressing this issue (e.g. adonis - "Permutational Multivariate Analysis of Variance Using Distance Matrices"). However, given a high number of variables, bv.step() seems to be an appropriate method for reducing this dimensionality. As a final example of this, the figure at the beginning of the post was created by overlaying vectors of the best correlated environmental and biological variables as identified through the bio.env() and bv.step() routines. Vectors are fitted to the existing NMDS plot of sample similarities using the function envfit() (package vegan).

###
MDS_res=metaMDS(varespec, distance = "bray", k = 2, trymax = 50)

bio.keep <- as.numeric(unlist(strsplit(res.bv.step.biobio$order.by.best$var.incl[1], ",")))
bio.fit <- envfit(MDS_res, varespec[,bio.keep], perm = 999)
> bio.fit

***VECTORS

            NMDS1     NMDS2     r2 Pr(>r)    
Vac.myr  0.997347 -0.072790 0.5018  0.001 ***
Des.fle  0.933940  0.357429 0.3249  0.001 ***
Dic.pol  0.627491 -0.778624 0.4049  0.001 ***
Cla.arb -0.673355  0.739320 0.3806  0.009 ** 
Cla.ran -0.960250  0.279143 0.5179  0.001 ***
Cla.ste -0.525775 -0.850624 0.4812  0.002 ** 
Cla.unc  0.305834  0.952085 0.0075  0.922    
Cla.cor  0.523069  0.852290 0.0367  0.635    
Cla.fim  0.092031  0.995756 0.0658  0.479    
Nep.arc  0.271896  0.962327 0.2113  0.090 .  
Ste.sp  -0.702757  0.711430 0.2412  0.033 *  
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
P values based on 999 permutations.


env.keep <- c(2,5,7,8,9,11,13)
env.fit <- envfit(MDS_res, alt.varechem[,env.keep], perm = 999)
> env.fit

***VECTORS

            NMDS1    NMDS2     r2 Pr(>r)    
P         0.61991 -0.78468 0.1938  0.105    
Mg        0.63267 -0.77442 0.4270  0.003 ** 
Al       -0.87155 -0.49030 0.5269  0.001 ***
Fe       -0.93595 -0.35212 0.4450  0.002 ** 
Mn        0.79867  0.60176 0.5231  0.001 ***
Mo       -0.90311 -0.42942 0.0609  0.531    
Humdepth  0.93277  0.36048 0.5201  0.002 ** 
---
Signif. codes:  0 ‘***’ 0.001 ‘**’ 0.01 ‘*’ 0.05 ‘.’ 0.1 ‘ ’ 1 
P values based on 999 permutations.

png(filename="MDS_w_bioenv+biobio.png", width=6, height=3, units="in", res=200)
par(mfcol=c(1,2), omi=c(0.1, 0.1, 0.1, 0.1), mar=c(4, 4, 1, 1), ps=8)
#first plot
plot(MDS_res$points, t="n",xlab="NMDS1", ylab="NMDS2")
plot(env.fit, col="gray50", cex=0.8, font=4)
text(MDS_res$points, as.character(1:length(MDS_res$points[,1])), cex=0.7)
text(min(MDS_res$points[,1]), max(MDS_res$points[,2]), paste("Stress =",round(MDS_res$stress/100, 2)), pos=4, font=3, col="gray30")
#second plot
plot(MDS_res$points, t="n",xlab="NMDS1", ylab="NMDS2")
plot(bio.fit, col="gray50", cex=0.8, font=4)
text(MDS_res$points, as.character(1:length(MDS_res$points[,1])), cex=0.7)
text(min(MDS_res$points[,1]), max(MDS_res$points[,2]), paste("Stress =",round(MDS_res$stress/100, 2)), pos=4, font=3, col="gray30")
dev.off()
###

The fact that some of the plotted vectors are not significant is probably due to the high stress of the MDS plot (Stress=0.18), which is very close to the recommended maximum 0.20 where mis-interpretation of the distances can occur. Since envfit() scales these vectors based on their correlation coefficient, the resulting plot still allows one to quickly identify the most important variable gradients represented by the MDS plot.



further references:
~ Clarke, K. R & Ainsworth, M. 1993. A method of linking multivariate community structure to environmental variables. Marine Ecology Progress Series, 92, 205-219.
~ Clarke, K. R., Gorley, R. N., 2001. PRIMER v5: User Manual/Tutorial. PRIMER-E, Plymouth, UK.
~ Clarke, K. R., Warwick, R. M., 2001. Changes in Marine Communities: An Approach to Statistical Analysis and Interpretation, 2nd
edition. PRIMER-E Ltd, Plymouth, UK.
~ Clarke, K. R., Warwick, R. M., 1998. Quantifying structural redundancy in ecological communities. Oecologia, 113:278-289.
~ Jari Oksanen, F. Guillaume Blanchet, Roeland Kindt, Pierre Legendre,
  R. B. O'Hara, Gavin L. Simpson, Peter Solymos, M. Henry H. Stevens
  and Helene Wagner (2010). vegan: Community Ecology Package. R package
  version 1.17-4. http://CRAN.R-project.org/package=vegan 


function bio.env()
bio.env <- function(fix.mat, var.mat, 
fix.dist.method="bray", var.dist.method="euclidean",
scale.fix=FALSE, scale.var=TRUE,
output.best=10,
var.max=ncol(var.mat)
){
 if(dim(fix.mat)[1] != dim(var.mat)[1]){stop("fixed and variable matrices must have the same number of rows")}
 if(var.max > dim(var.mat)[2]){stop("var.max cannot be larger than the number of variables (columns) in var.mat")}
 
 require(vegan)
 
 combn.sum <- sum(factorial(ncol(var.mat))/(factorial(1:var.max)*factorial(ncol(var.mat)-1:var.max)))
 
 if(scale.fix){fix.mat<-scale(fix.mat)}else{fix.mat<-fix.mat}
 if(scale.var){var.mat<-scale(var.mat)}else{var.mat<-var.mat}
 fix.dist <- vegdist(fix.mat, method=fix.dist.method)
 RES_TOT <- c()
 best.i.comb <- c()
 iter <- 0
 for(i in 1:var.max){
  var.comb <- combn(1:ncol(var.mat), i, simplify=FALSE)
  RES <- data.frame(var.incl=rep(NA, length(var.comb)), n.var=i, rho=0)
  for(f in 1:length(var.comb)){
   iter <- iter+1
   var.dist <- vegdist(as.matrix(var.mat[,var.comb[[f]]]), method=var.dist.method)
   temp <- suppressWarnings(cor.test(fix.dist, var.dist, method="spearman"))
   RES$var.incl[f] <- paste(var.comb[[f]], collapse=",")
   RES$rho[f] <- temp$estimate
   if(iter %% 100 == 0){print(paste(round(iter/combn.sum*100, 3), "% finished"))}
  }
 
  order.rho <- order(RES$rho, decreasing=TRUE)
  best.i.comb <- c(best.i.comb, RES$var.incl[order.rho[1]])
  if(length(order.rho) > output.best){
   RES_TOT <- rbind(RES_TOT, RES[order.rho[1:output.best],])
  } else {
   RES_TOT <- rbind(RES_TOT, RES)
  }
 }
 rownames(RES_TOT)<-NULL
 
 if(dim(RES_TOT)[1] > output.best){
  order.by.best <- order(RES_TOT$rho, decreasing=TRUE)[1:output.best]
 } else {
  order.by.best <- order(RES_TOT$rho, decreasing=TRUE)
 }
 OBB <- RES_TOT[order.by.best,]
 rownames(OBB) <- NULL
 
 order.by.i.comb <- match(best.i.comb, RES_TOT$var.incl)
 OBC <- RES_TOT[order.by.i.comb,]
 rownames(OBC) <- NULL
 
 out <- list(
  order.by.best=OBB,
  order.by.i.comb=OBC,
  best.model.vars=paste(colnames(var.mat)[as.numeric(unlist(strsplit(OBB$var.incl[1], ",")))], collapse=",") ,
  best.model.rho=OBB$rho[1]
 )
 out
}
Created by Pretty R at inside-R.org

function bv.step()
bv.step <- function(fix.mat, var.mat, 
fix.dist.method="bray", var.dist.method="euclidean",
scale.fix=FALSE, scale.var=TRUE,
max.rho=0.95,
min.delta.rho=0.001,
random.selection=TRUE,
prop.selected.var=0.2,
num.restarts=10,
var.always.include=NULL,
var.exclude=NULL,
output.best=10
){
 
 if(dim(fix.mat)[1] != dim(var.mat)[1]){stop("fixed and variable matrices must have the same number of rows")}
 if(sum(var.always.include %in% var.exclude) > 0){stop("var.always.include and var.exclude share a variable")}
 require(vegan)
 
 if(scale.fix){fix.mat<-scale(fix.mat)}else{fix.mat<-fix.mat}
 if(scale.var){var.mat<-scale(var.mat)}else{var.mat<-var.mat}
 
 fix.dist <- vegdist(as.matrix(fix.mat), method=fix.dist.method)
 
 #an initial removal phase
 var.dist.full <- vegdist(as.matrix(var.mat), method=var.dist.method)
 full.cor <- suppressWarnings(cor.test(fix.dist, var.dist.full, method="spearman"))$estimate
 var.comb <- combn(1:ncol(var.mat), ncol(var.mat)-1)
 RES <- data.frame(var.excl=rep(NA,ncol(var.comb)), n.var=ncol(var.mat)-1, rho=NA)
 for(i in 1:dim(var.comb)[2]){
  var.dist <- vegdist(as.matrix(var.mat[,var.comb[,i]]), method=var.dist.method)
  temp <- suppressWarnings(cor.test(fix.dist, var.dist, method="spearman"))
  RES$var.excl[i] <- c(1:ncol(var.mat))[-var.comb[,i]]
  RES$rho[i] <- temp$estimate
 }
 delta.rho <- RES$rho - full.cor
 exclude <- sort(unique(c(RES$var.excl[which(abs(delta.rho) < min.delta.rho)], var.exclude)))
 
 if(random.selection){
  num.restarts=num.restarts
  prop.selected.var=prop.selected.var
  prob<-rep(1,ncol(var.mat))
  if(prop.selected.var< 1){
   prob[exclude]<-0
  }
  n.selected.var <- min(sum(prob),prop.selected.var*dim(var.mat)[2])
 } else {
  num.restarts=1
  prop.selected.var=1  
  prob<-rep(1,ncol(var.mat))
  n.selected.var <- min(sum(prob),prop.selected.var*dim(var.mat)[2])
 }
 
 RES_TOT <- c()
 for(i in 1:num.restarts){
  step=1
  RES <- data.frame(step=step, step.dir="F", var.incl=NA, n.var=0, rho=0)
  attr(RES$step.dir, "levels") <- c("F","B")
  best.comb <- which.max(RES$rho)
  best.rho <- RES$rho[best.comb]
  delta.rho <- Inf
  selected.var <- sort(unique(c(sample(1:dim(var.mat)[2], n.selected.var, prob=prob), var.always.include)))
  while(best.rho < max.rho & delta.rho > min.delta.rho & RES$n.var[best.comb] < length(selected.var)){
   #forward step
   step.dir="F"
   step=step+1
   var.comb <- combn(selected.var, RES$n.var[best.comb]+1, simplify=FALSE)
   if(RES$n.var[best.comb] == 0){
    var.comb.incl<-1:length(var.comb)
   } else {
    var.keep <- as.numeric(unlist(strsplit(RES$var.incl[best.comb], ",")))
    temp <- NA*1:length(var.comb)
    for(j in 1:length(temp)){
     temp[j] <- all(var.keep %in% var.comb[[j]]) 
    }
    var.comb.incl <- which(temp==1)
   }
 
   RES.f <- data.frame(step=rep(step, length(var.comb.incl)), step.dir=step.dir, var.incl=NA, n.var=RES$n.var[best.comb]+1, rho=NA)
   for(f in 1:length(var.comb.incl)){
    var.incl <- var.comb[[var.comb.incl[f]]]
    var.incl <- var.incl[order(var.incl)]
    var.dist <- vegdist(as.matrix(var.mat[,var.incl]), method=var.dist.method)
    temp <- suppressWarnings(cor.test(fix.dist, var.dist, method="spearman"))
    RES.f$var.incl[f] <- paste(var.incl, collapse=",")
    RES.f$rho[f] <- temp$estimate
   }
 
   last.F <- max(which(RES$step.dir=="F"))
   RES <- rbind(RES, RES.f[which.max(RES.f$rho),])
   best.comb <- which.max(RES$rho)
   delta.rho <- RES$rho[best.comb] - best.rho 
   best.rho <- RES$rho[best.comb]
 
   if(best.comb == step){
    while(best.comb == step & RES$n.var[best.comb] > 1){
     #backward step
     step.dir="B"
     step <- step+1
     var.keep <- as.numeric(unlist(strsplit(RES$var.incl[best.comb], ",")))
     var.comb <- combn(var.keep, RES$n.var[best.comb]-1, simplify=FALSE)
     RES.b <- data.frame(step=rep(step, length(var.comb)), step.dir=step.dir, var.incl=NA, n.var=RES$n.var[best.comb]-1, rho=NA)
     for(b in 1:length(var.comb)){
      var.incl <- var.comb[[b]]
      var.incl <- var.incl[order(var.incl)]
      var.dist <- vegdist(as.matrix(var.mat[,var.incl]), method=var.dist.method)
      temp <- suppressWarnings(cor.test(fix.dist, var.dist, method="spearman"))
      RES.b$var.incl[b] <- paste(var.incl, collapse=",")
      RES.b$rho[b] <- temp$estimate
     }
     RES <- rbind(RES, RES.b[which.max(RES.b$rho),])
     best.comb <- which.max(RES$rho)
     best.rho<- RES$rho[best.comb]
    }
   } else {
    break()
   }
 
  }
 
  RES_TOT <- rbind(RES_TOT, RES[2:dim(RES)[1],])
  print(paste(round((i/num.restarts)*100,3), "% finished"))
 }
 
 RES_TOT <- unique(RES_TOT[,3:5])
 
 
 if(dim(RES_TOT)[1] > output.best){
  order.by.best <- RES_TOT[order(RES_TOT$rho, decreasing=TRUE)[1:output.best],]
 } else {
  order.by.best <-  RES_TOT[order(RES_TOT$rho, decreasing=TRUE), ]
 }
 rownames(order.by.best)<-NULL
 
 order.by.i.comb <- c()
 for(i in 1:length(selected.var)){
  f1 <- which(RES_TOT$n.var==i)
  f2 <- which.max(RES_TOT$rho[f1])
  order.by.i.comb <- rbind(order.by.i.comb, RES_TOT[f1[f2],])
 }
 rownames(order.by.i.comb)<-NULL
 
 if(length(exclude)<1){var.exclude=NULL} else {var.exclude=exclude}
 out <- list(
  order.by.best=order.by.best,
  order.by.i.comb=order.by.i.comb,
  best.model.vars=paste(colnames(var.mat)[as.numeric(unlist(strsplit(order.by.best$var.incl[1], ",")))], collapse=","),
  best.model.rho=order.by.best$rho[1],
  var.always.include=var.always.include,
  var.exclude=var.exclude
  )
 out
 
}
Created by Pretty R at inside-R.org

19 comments:

  1. I would like to change the input to your bio.env script. However, I'm not very good at finding what I should change. I would like to make it possible to put in a fix.dist matrix that has already been created using another algorithm. For instance, I have a genetic dissimilarity matrix and I would like to use that as the inout fix.dist. Would there be a quick way to make the code do this?

    Robby M.

    ReplyDelete
  2. You could change the input parameters of the function to accept a "fix.dist" object instead of a "fix.mat" object:

    bio.env <- function(fix.dist, var.mat,
    fix.dist.method="bray", var.dist.method="euclidean",
    scale.fix=FALSE, scale.var=TRUE,
    output.best=10,
    var.max=ncol(var.mat)
    ){

    Then, delete the line:

    fix.dist <- vegdist(fix.mat, method=fix.dist.method)

    I'm not sure that it matters, but the vegdist function from the package vegan computes a triangular matrix, so you will have to make sure that the calculation of the Spearman coefficient is working correctly with your dissimilarity matrix.

    ReplyDelete
  3. Any idea why the bioenv function in vegan is quicker than your bio.env script?

    ReplyDelete
  4. No, unfortunately I don't know why it's faster. Is there a large difference? The authors of vegan are obviously much better programmers than myself, so it doesn't really surprise me.

    ReplyDelete
  5. Hi! I would need to use bioenv but I have genetic and environmental distances as response and predictor variables. Thus I don't know how to input the matrices already as distance matrices and not as dataframe. Thus I need to change not only the fix.mat into fix.dist but also the var.mat into var.dist
    How shall I do it?
    Thank you!

    ReplyDelete
  6. hello,
    I would like to use your bv.step function on 16s data, how can i use your script?

    ReplyDelete
    Replies
    1. hi - Not sure what you mean by "16s" data - seems to be genetic information. I can't really comment if on that specifically, but in principle, it should be fine. The more relevant question might be how to treat your data and what kind of distance matrix to use for such data.

      Delete
  7. hi,

    I want to use the bvstep with a genetic relatedness matrix as fixed distance matrix. I´m just struggling a bit what to change in the code. Do you have any advice? Btw. very good job, implementing bvstep. Didn´t find any other source doing that.

    Best,
    Martin

    ReplyDelete
    Replies
    1. Hi Martin,
      The 'fix.mat' argument should contain your original genetic data, not the relatedness matrix. You then specify the distance measure with 'fix.dist.method'. At the moment, I have this calling the 'veg.dist' function from vegan, so you would have to use one of the distance measure implemented therein. If you want to use your own dstance measure, then you would need to adapt the function to accept your 'fix.dist'.
      Hope that helps. I'm also surprised that no one has tried to implement this (or improve on it).
      Cheers, Marc

      Delete
  8. Hi Marc. Firstly, thanks for producing these functions, being able to do these analyses on a species matrix x species matrix basis was one of the last things I needed Primer for. However, when I run your function using my species x site matrix the function runs OK but produces one of these sets of 2 warnings per (I assume) species.

    1: In vegdist(as.matrix(var.mat[, var.incl]), method = var.dist.method) : you have empty rows: their dissimilarities may be meaningless in method “bray”
    2: In vegdist(as.matrix(var.mat[, var.incl]), method = var.dist.method) : missing values in results

    I get these even when using the varespec data from vegan and my species x site matrix is setup as per varespec (sites in rows, species in columns)

    Any suggestions as to what these warnings mean? The function does run ok otherwise and provides meaningful results. All sites have multiple species and all species occurred in at least two sites, so its not a zero row/column sum problem as I've seen suggested elsewhere as a cause of this problem when running vegdist.

    I am running exactly your code as above:

    <- bv.step(wisconsin(varespec), wisconsin(varespec),
    fix.dist.method="bray", var.dist.method="bray",
    scale.fix=FALSE, scale.var=FALSE,
    max.rho=0.95, min.delta.rho=0.001,
    random.selection=TRUE,
    prop.selected.var=0.3,
    num.restarts=50,
    output.best=10,
    var.always.include=NULL)

    Cheers,
    Adrian.

    ReplyDelete
    Replies
    1. Hello Adrian,

      Glad you find the function useful. I should have mentioned the errors in the post. This is nothing to worry about actually. The full community matrix has no rows (samples) that are empty, i.e. where all values are zero (it wouldn't make sense to include a sample where nothing was found). But, when we sub-sample the community matrix in bv.step to only include a subset of species, then there will be cases when the "var.mat" will have empty rows, especially when we are selecting only a few species, such as the beginning of a new restart.

      If these warnings bother you, you could adapt the 2 lines of code where "var.dist" is calculated, to include "suppressWarnings()":

      var.dist <- suppressWarnings(vegan::vegdist(as.matrix(var.mat[,var.incl]), method = var.dist.method))

      Hope that helps.

      Cheers,
      Marc

      Delete
  9. Hi, and thank you for your post, it has been very useful to me. I have a question: I've used the capscale function to generate a PCoA plot with Bray Curtis distances instead of an NMDS plot with metaMDS and it works fine, but I would also like to use binary Jaccard distances to generate the PCoA plot and vectors, how could I do this using your code? I've tried writing 'jaccard' wherever I see a distance argument in your code and adding 'binary=TRUE', but this crude attempt has been unsuccessful.

    Regards,
    Melina.

    ReplyDelete
  10. Thanks for your time and effort to produce this code, it has been super useful!
    I just have a question about BIO-ENV - I'm unsure what similarity matrix I should use - my environmental data is mixed and includes both continuous and binary data .
    Many thanks for any help.
    Best wishes,
    Kate

    ReplyDelete
    Replies
    1. Hi Kate. I'm not an expert in similarity indices, but I believe that you are right about the Gower index being suitable for matrices containing a mixture of continuous and factor variables. I know that this index has been used for analyses of functional diversity, which contains both. Hope that helps, Marc

      Delete
  11. Hi there, this seems like a really useful function. Is it suitable for comparing 2 sets of species data (e.g. plants and invertebrates) for a number of sites to see which species co-occur?
    Thanks.

    ReplyDelete
    Replies
    1. I imagine you could do that, but I'm not aware of examples. The fixed matrix found be the spp group that you think is structured by the other (variable) matrix. So it's not exactly returning species that co-occur, rather you are finding out the subset of spp from the variable matrix that best correlates with the fixed one. For co-occurrence, you may want to look into the concept of "structural redundancy" (Clarke and Warwick (1998).

      Clarke, K. R., and R. M. Warwick. "Quantifying structural redundancy in ecological communities." Oecologia 113.2 (1998): 278-289.

      Delete
  12. How can we cite these functions?

    ReplyDelete
    Replies
    1. The 'sinkr' package contains both functions with slightly different names: 'bioEnv' and 'bvStep'. You can find the package here: https://github.com/marchtaylor/sinkr. After installing, you can get citation information by running citation("sinkr") in R.

      Delete
  13. Hello, thank you for your great post on this approach. Does one need to be concerned with collinearity in various environmental variables? I have a number of closely related environmental variables but given this non parametric fitting approach, my temptation is to throw them all into the bioenv procedure to see the full suite of variables influencing communities.
    Thank you for any thoughts.
    paul

    ReplyDelete