Tuesday, September 30, 2014

Additional tips for structuring an individual-based model in R

I had a reader ask me recently to help understand how to modify the code of an individual-based model (IBM) that I posted a while back. It was my first attempt at an IBM in R, and I realized that I have made some significant changes to the way that I code such models nowadays. Most of the changes are structural, but seem to help a lot in clearly documenting the model and its underlying processes.
Basically, I follow a structure that a friend taught me (a very experienced modeller, specializing in IBMs). Granted R isn't the best language for such models but, depending on your computational needs, it can also be quite easy to implement if you already have experience in basic R programming. The idea is to code important processes of the IBM as functions, which accept an object of individuals as the main argument. The functions can be as simple or elaborate as needed, but the key is that when you finally set up your simulation, you only need to call these functions again. This results in easily legible code, where you are less likely to get lost in the model and can concentrate more on the processes to be performed during each iteration (e.g. growth, reproduction, mortality):

inds <- grow.inds(inds)
inds <- reproduce.inds(inds)
inds <- kill.inds(inds)


Since the previous post, I have gone towards using data.frame objects to store individuals and their attributes since I find it easier to apply functions for extracting summary statistics (e.g. histogram of age distribution in the final population):


The example shown here is a modification of the genetic drift example that I showed before - only this time I have included 4 color phenotypes. The model runs until either one phenotype dominates the population or a maximum number of iterations is reached. Reproduction allows for a individuals to have more than one offspring per time step, but skewed towards zero. Death is modeled by a constant instantaneous mortality rate. I have used such a setup with more complicated models of fish genetics and found the performance to be quite fast, even with population sizes of 300,000 individuals. The key is to maintain as much vectorization in the functions as possible.

To reproduce the example:

Wednesday, September 17, 2014

Maximal Information Coefficient (Part II)



A while back, I wrote a post simply announcing a recent paper that described a new statistic called the "Maximal Information Coefficient" (MIC), which is able to describe the correlation between paired variables regardless of linear or nonlinear relationship. This turned out to be quite a popular post, and included a lively discussion as to the merits of the work and difficulties in using the software provided by the authors. Regarding the latter, I also had difficulties running the software on R and thus did not include an example. Checking back on this topic, I was pleased to see that an R package had subsequently been developed: minerva: Maximal Information-Based Nonparametric Exploration R package for Variable Analysis (Albanese et al. 2013). Further documentation of the package can be found here: http://minepy.sourceforge.net/

I tried out the package on the baseball data set used in the original paper by Reshef et al. (2011), where a suite of variables are correlated against a baseball player's salary. The author's state in their paper:
"In the MLB data set (131 variables), MIC and ρ both identified many linear relationships, but interesting differences emerged. On the basis of p, the strongest three correlates with player salary are walks, intentional walks, and runs batted in. By contrast, the strongest three associations according to MIC are hits, total bases, and a popular aggregate offensive statistic called Replacement Level Marginal Lineup Value (27, 34) (fig. S12 and table S12). We leave it to baseball enthusiasts to decide which of these statistics are (or should be!) more strongly tied to salary."
Here is a summary from the results computed with the function mine() of the minerva package (top 10 ranking MIC coefficients), which reproduces the same results as are shown in the Supplementary table S12 of the original paper:


For a visual representation of these results, the top figure plots MIC vs. Pearson and MIC Rank vs. Pearson Rank. Thanks to minerva author and maintainer M. Filosi for helping in reproducing the example.


References:

Albanese, D., Filosi, M., Visintainer, R., Riccadonna, S., Jurman, G., & Furlanello, C. (2013). minerva and minepy: a C engine for the MINE suite and its R, Python and MATLAB wrappers. Bioinformatics, 29(3), 407-408. [link]

Reshef, D. N., Reshef, Y. A., Finucane, H. K., Grossman, S. R., McVean, G., Turnbaugh, P. J., ... & Sabeti, P. C. (2011). Detecting novel associations in large data sets. science, 334(6062), 1518-1524. [link]


Code to reproduce the example:

Monday, September 15, 2014

PCA / EOF for data with missing values - a comparison of accuracy

Not all Principal Component Analysis (PCA) (also called Empirical Orthogonal Function analysis, EOF) approaches are equal when it comes to dealing with a data field that contain missing values (i.e. "gappy"). The following post compares several methods by assessing the accuracy of the derived PCs to reconstruct the "true" data set, as was similarly conducted by Taylor et al. (2013).

The gappy EOF methods to be compared are:
  1. LSEOF - "Least-Squares Empirical Orthogonal Functions" - The traditional approach, which modifies the covariance matrix used for the EOF decomposition by the number of paired observations, and further scales the projected PCs by these same weightings (see Björnsson and Venegas 1997, von Storch and Zweiers 1999 for details).
  2. RSEOF - "Recursively Subtracted Empirical Orthogonal Functions" - This approach modifies the LSEOF approach by recursively solving for the leading EOF, whose reconstructed field is then subtracted from the original field. This recursive subtraction is done until a given stopping point (i.e. number of EOFs, % remaining variance, etc.) (see Taylor et al. 2013 for details)
  3. DINEOF - "Data Interpolating Empirical Orthogonal Functions" - This approach gradually solves for EOFs by means of an iterative algorothm to fit EOFS to a given number of non-missing value reference points (small percentage of observations) via RMSE minimization (see Beckers and Rixen 2003 for details).
I have introduced both the LSEOF [link] and DINEOF [link] methods in the past, but have never directly compared them for the blog. The purpose of this post is to make this comparison and to also introduce a more general EOF function that is capable of conducting RSEOF. All analyses can be reproduced following installation of the "sinkr" package: https://github.com/marchtaylor/sinkr

The basic problem comes down to the difficulties of decomposing a matrix that is not "positive-definite", i.e. the estimated covariance matrix from a gappy data set. DINEOF entirely avoids this issue by first interpolating the values to create a full data field, while LSEOF and RSEOF rely on decomposing this estimation. A known problem is that the trailing EOFs derived from such a matrix are amplified in their singular values, which can consequently amplify errors in field reconstructions when included. The RSEOF approach thus attempts to remedy these issues by recursively solving for only leading EOFs. In the following examples, I show the performance of the three approaches in terms of reconstructing the data field (including the "true" values).

Example 1 - Synthetic data set:
The first example uses a synthetic data set used by Beckers and Rixen (2003) in their introduction of the DINEOF approach. The accuracy of the reconstruction is dependent on the number of  EOFs used. In a non-gappy example, a perfect reconstruction should be possible using this full set of EOFs - In fact it only takes 9 EOFs when using the non-noisy true field, since it is a composite of 9 signals. In the case of the noisy, gappy data sets, reconstructions with trailing EOFs may increase errors. This can be seen in the figure at the top of the post showing RMSE vs the number of EOFs used in the reconstruction.

The figure shows the DINEOF approach to be the most accurate. The LSEOF approach has a clear RSME minimum with 4 EOFs, while the RSEOF approach was largely able to remedy the amplification of error when using trailing EOFs. The problem of error amplification is even more dramatic when viewed visually, as in the following where the full set of EOFs have been used:

Tuesday, September 2, 2014

"sinkr" - a collection of functions featured on "me nugget"

The R package sinkr (version 1.0) has now been released:  https://github.com/marchtaylor/sinkr

I have finally gotten around to learning how to create an R package and decided to start by bundling functions that I have featured on the blog. Thanks to the R Studio team for making this so easy (in combination with the R packages roxygen2 and devtools). In addition to the great tips on the R Studio website,  I found the following Youtube videos helpful along the way:
Being new to the world of R packages (and due to the eclectic nature of the functions in sinkr), I'm not confident to upload this to CRAN. But, one can easily install the package using devtools and the following code (see https://github.com/hadley/devtools for OS-specific package updating tips):

library(devtools)

install_github("marchtaylor/sinkr")

For those using functions that were featured on this site in the past, the present versions may differ slightly, especially in the case of function names. For example, function names like plot.stacked were getting associated with the general plot function during the package build and thus "." characters have been removed from all function names (e.g. plotStacked).

sinkr functions include: