Wednesday, April 30, 2014

Decision making trees and machine learning resources for R

I have recently come across Ricky Ho's blog "Pragmatic Programming Techniques", which seems to be excellent resource for all sorts of aspects regarding data exploration and predictive modelling. The post "Six steps in data science" provides a nice overview to some of the topics covered in the blog. For some reason, this blog does not seem to be listed on R-Bloggers (attn: Tal Galili!).

I was drawn to the page from my interest in understanding Classification and Regression Tree (CART) models, and quickly became amazed by the blog's nice review of some available methods. I was specifically looking for a example that uses Edgar Anderson's iris data set as I find it to be a very understandable example for this type of problem - i.e. Can we develop a model to predict the iris species based on it's morphological characteristics?

Below is a sample of just two methods that were presented using the rpart and randomForest packages. rpart is a referred to as a "Decision Tree" method, while randomForest is an example of a "Tree Ensemble" method. The blog explains many of the pros and cons for each method, and a further post show even further examples of predictive analytics, including Neural Network, Support Vector Machine, Naive Bayes and Nearest Neighbor approaches (all using the iris data set). I would love to know a bit more about the comparative predictive powers of each of these methods. For the meantime, the example below shows a cross-validation comparison of prediction accuracy for the rpart and randomForest methods using 100 permutations. Half the data set is used as the training set and the other half is used as the validation set.



The results show a slight improvement in accuracy for the randomForest method, especially for the species versicolor and virginica - which are more similar in morphology. This can be see in the degree of overlap in the plot of the first 2 principle components (explaining ~98% of the variance):




The setosa species is different enough that there is perfect (100%) accuracy in it's prediction. I'm looking forward to continuing with this comparison for the other methods as well.

Example code: