Reinventing the wheel for ordination biplots with ggplot2

I’ll be the first to admit that the topic of plotting ordination results using ggplot2 has been visited many times over. As is my typical fashion, I started creating a package for this purpose without completely searching for existing solutions. Specifically, the ggbiplot and factoextra packages already provide almost complete coverage of plotting results from multivariate and ordination analyses in R. Being the stubborn individual, I couldn’t give up on my own package so I started exploring ways to improve some of the functionality of biplot methods in these existing packages. For example, ggbiplot and factoextra work almost exclusively with results from principal components analysis, whereas numerous other multivariate analyses can be visualized using the biplot approach. I started to write methods to create biplots for some of the more common ordination techniques, in addition to all of the functions I could find in R that conduct PCA. This exercise became very boring very quickly so I stopped adding methods after the first eight or so. That being said, I present this blog as a sinking ship that was doomed from the beginning, but I’m also hopeful that these functions can be built on by others more ambitious than myself.

The process of adding methods to a default biplot function in ggplot was pretty simple and not the least bit interesting. The default ggpord biplot function (see here) is very similar to the default biplot function from the stats base package. Only two inputs are used, the first being a two column matrix of the observation scores for each axis in the biplot and the second being a two column matrix of the variable scores for each axis. Adding S3 methods to the generic function required extracting the relevant elements from each model object and then passing them to the default function. Easy as pie but boring as hell.

I’ll repeat myself again. This package adds nothing new to the functionality already provided by ggbiplot and factoextra. However, I like to think that I contributed at least a little bit by adding more methods to the biplot function. On top of that, I’m also naively hopeful that others will be inspired to fork my package and add methods. Here you can view the raw code for the ggord default function and all methods added to that function. Adding more methods is straightforward, but I personally don’t have any interest in doing this myself. So who wants to help??

Visit the package repo here or install the package as follows.

library(devtools)
install_github('fawda123/ggord')
library(ggord)

Available methods and examples for each are shown below. These plots can also be reproduced from the examples in the ggord help file.

##  [1] ggord.acm      ggord.ca       ggord.coa      ggord.default 
##  [5] ggord.lda      ggord.mca      ggord.MCA      ggord.metaMDS 
##  [9] ggord.pca      ggord.PCA      ggord.prcomp   ggord.princomp
# principal components analysis with the iris data set
# prcomp
ord <- prcomp(iris[, 1:4])

p <- ggord(ord, iris$Species)
p

p + scale_colour_manual('Species', values = c('purple', 'orange', 'blue'))

p + theme_classic()

p + theme(legend.position = 'top')

p + scale_x_continuous(limits = c(-2, 2))

# principal components analysis with the iris dataset
# princomp
ord <- princomp(iris[, 1:4])

ggord(ord, iris$Species)

# principal components analysis with the iris dataset
# PCA
library(FactoMineR)

ord <- PCA(iris[, 1:4], graph = FALSE)

ggord(ord, iris$Species)

# principal components analysis with the iris dataset
# dudi.pca
library(ade4)

ord <- dudi.pca(iris[, 1:4], scannf = FALSE, nf = 4)

ggord(ord, iris$Species)

# multiple correspondence analysis with the tea dataset
# MCA
data(tea)
tea <- tea[, c('Tea', 'sugar', 'price', 'age_Q', 'sex')]

ord <- MCA(tea[, -1], graph = FALSE)

ggord(ord, tea$Tea)

# multiple correspondence analysis with the tea dataset
# mca
library(MASS)

ord <- mca(tea[, -1])

ggord(ord, tea$Tea)

# multiple correspondence analysis with the tea dataset
# acm
ord <- dudi.acm(tea[, -1], scannf = FALSE)

ggord(ord, tea$Tea)

# nonmetric multidimensional scaling with the iris dataset
# metaMDS
library(vegan)
ord <- metaMDS(iris[, 1:4])

ggord(ord, iris$Species)

# linear discriminant analysis
# example from lda in MASS package
ord <- lda(Species ~ ., iris, prior = rep(1, 3)/3)

ggord(ord, iris$Species)

# correspondence analysis
# dudi.coa
ord <- dudi.coa(iris[, 1:4], scannf = FALSE, nf = 4)

ggord(ord, iris$Species)

# correspondence analysis
# ca
library(ca)
ord <- ca(iris[, 1:4])

ggord(ord, iris$Species)

Cheers,

Marcus

Advertisements

Some love for ggplot2

With all the recent buzz about ggvis (this, this, and this) it’s often easy to forget all that ggplot2 offers as a graphics package. True, ggplot is a static approach to graphing unlike ggvis but it has fundamentally changed the way we think about plots in R. I recently spent some time thinking about some of the more useful features of ggplot2 to answer the question ‘what is offered by ggplot2 that one can’t do with the base graphics functions?’ First-time users of ggplot2 are often confused by the syntax, yet it is precisely this syntax built on the philosophy of the grammar of graphics that makes ggplot2 so powerful. Adding content layers to mapped objects are central to this idea, which allows linking of map aesthetics through a logical framework. Additionally, several packages have been developed around this philosophy to extend the functionality of ggplot2 in alternative applications (e.g., ggmap, GGally, ggthemes).

I recently gave a presentation to describe some of my favorite features of ggplot2 and other packages building on its core concepts. I describe the use of facets for multi-panel plots, default and custom themes, ggmap for spatial mapping with ggplot2, and GGally for generalized pairs plots. Although this is certainly a subjective and incomplete list, my workflows have become much more efficient (and enjoyable) by using these tools. Below is a link to the presentation. Note that this will not load using internet explorer and you may have to reload if using Chrome to get the complete slide deck. This is my first time hosting a Slidify presentation on RPubs, so please bear with me. The presentation materials are also available at Github.

ggplot2 redux.

What are some of your favorite features of ggplot2??

Cheers,

Marcus

A nifty area plot (or a bootleg of a ggplot geom)

The ideas for most of my blogs usually come from half-baked attempts to create some neat or useful feature that hasn’t been implemented in R. These ideas might come from some analysis I’ve used in my own research or from some other creation meant to save time. More often than not, my blogs are motivated by data visualization techniques meant to provide useful ways of thinking about analysis results or comparing pieces of information. I was particularly excited (anxious?) after I came across this really neat graphic that ambitiously portrays the progression of civilization since ca. BC 2000. The original image was created John B. Sparks and was first printed by Rand McNally in 1931.1 The ‘Histomap’ illustrates the rise and fall of various empires and civilizations through an increasing time series up to present day. Color-coded chunks at each time step indicate the relative importance or dominance of each group. Although the methodology by which ‘relative dominance’ was determined is unclear, the map provides an impressive synopsis of human history.

histomap_ex
Fig: Snippet of the histomap. See the footnote for full link.1


Historical significance aside, I couldn’t help but wonder if a similar figure could be reproduced in R (I am not a historian). I started work on a function to graphically display the relative importance of categorical data across a finite time series. After a few hours of working, I soon realized that plot functions in the ggplot2 package can already accomplish this task. Specifically, the geom_area ‘geom’ provides a ‘continuous analog of a stacked bar chart, and can be used to show how composition of the whole varies over the range of x’.2 I wasn’t the least bit surprised that this functionality was already available in ggplot2. Rather than scrapping my function entirely, I decided to stay the course in the naive hope that a stand-alone function that was independent of ggplot2 might be useful. Consider this blog my attempt at ripping-off some ideas from Hadley Wickham.

The first question that came to mind when I started the function was the type of data to illustrate. The data should illustrate changes in relative values (abundance?) for different categories or groups across time. The only assumption is that the relative values are all positive. Here’s how I created some sample data:

#create data
set.seed(3)

#time steps
t.step<-seq(0,20)

#group names
grps<-letters[1:10]

#random data for group values across time
grp.dat<-runif(length(t.step)*length(grps),5,15)

#create data frame for use with plot
grp.dat<-matrix(grp.dat,nrow=length(t.step),ncol=length(grps))
grp.dat<-data.frame(grp.dat,row.names=t.step)
names(grp.dat)<-grps

The code creates random data from a uniform distribution for ten groups of variables across twenty time steps. The approach is similar to the method used in my blog about a nifty line plot. The data defined here can also be used with the line plot as an alternative approach for visualization. The only difference between this data format and the latter is that the time steps in this example are the row names, rather than a separate column. The difference is trivial but in hindsight I should have kept them the same for compatibility between plotting functions. The data have the following form for the first four groups and first three time steps.

a b c d
0 6.7 5.2 6.7 6.0
1 13.1 6.3 10.7 12.7
2 8.8 5.9 9.2 8.0
3 8.3 7.4 7.7 12.7

The plotting function, named plot.area, can be imported and implemented as follows (or just copy from the link):

source("https://gist.github.com/fawda123/6589541/raw/8de8b1f26c7904ad5b32d56ce0902e1d93b89420/plot_area.r")

plot.area(grp.dat)
Fig: The plot.area function in action.


The function indicates the relative values of each group as a proportion from 0-1 by default. The function arguments are as follows:

x data frame with input data, row names are time steps
col character string of at least one color vector for the groups, passed to colorRampPalette, default is ‘lightblue’ to ‘green’
horiz logical indicating if time steps are arranged horizontally, default F
prop logical indicating if group values are displayed as relative proportions from 0–1, default T
stp.ln logical indicating if parallel lines are plotted indicating the time steps, defualt T
grp.ln logical indicating if lines are plotted connecting values of individual groups, defualt T
axs.cex font size for axis values, default 1
axs.lab logical indicating if labels are plotted on each axis, default T
lab.cex font size for axis labels, default 1
names character string of length three indicating axis names, default ‘Group’, ‘Step’, ‘Value’
... additional arguments to passed to par

I arbitrarily chose the color scheme as a ramp from light blue to green. Any combination of color values as input to colorRampPalette can be used. Individual colors for each group will be used if the number of input colors is equal to the number of groups. Here’s an example of another color ramp.

plot.area(grp.dat,col=c('red','lightgreen','purple'))
Fig: The plot.area function using a color ramp from red, light green, to purple.


The function also allows customization of the lines that connect time steps and groups.

plot.area(grp.dat,col=c('red','lightgreen','purple'),grp.ln=F)

plot.area(grp.dat,col=c('red','lightgreen','purple'),stp.ln=F)
plt-ar3plt-ar4
Fig: The plot.area function with different line customizations.

Finally, the prop argument can be used to retain the original values of the data and the horiz argument can be used to plot groups horizontally.

plot.area(grp.dat,prop=F)

plot.area(grp.dat,prop=F,horiz=T)
plt-ar52plt-ar62
Fig: The plot.area function with changing arguments for prop and horiz.

Now is a useful time to illustrate how these graphs can be replicated in ggplot2. After some quick reshaping of our data, we can create a similar plot as above (full code here):

require(ggplot2)
require(reshape)

#reshape the data
p.dat<-data.frame(step=row.names(grp.dat),grp.dat,stringsAsFactors=F)
p.dat<-melt(p.dat,id='step')
p.dat$step<-as.numeric(p.dat$step)

#create plots
p<-ggplot(p.dat,aes(x=step,y=value)) + theme(legend.position="none")
p + geom_area(aes(fill=variable)) 
p + geom_area(aes(fill=variable),position='fill')
Fig: The ggplot2 approach to plotting the data.


Same plots, right? My function is practically useless given the tools in ggplot2. However, the plot.area function is completely independent, allowing for easy manipulation of the source code. I’ll leave it up to you to decide which approach is most useful.

-Marcus

1http://www.slate.com/blogs/the_vault/2013/08/12/the_1931_histomap_the_entire_history_of_the_world_distilled_into_a_single.html
2http://docs.ggplot2.org/current/geom_area.html