A nifty line plot to visualize multivariate time series

A few days ago a colleague came to me for advice on the interpretation of some data. The dataset was large and included measurements for twenty-six species at several site-year-plot combinations. A substantial amount of effort had clearly been made to ensure every species at every site over several years was documented. I don’t pretend to hide my excitement of handling large, messy datasets, so I offered to lend a hand. We briefly discussed a few options and came to the conclusion that simple was better, at least in an exploratory sense. The challenge was to develop a plot that was informative, but also maintained the integrity of the data. Specifically, we weren’t interested in condensing information using synthetic axes created from a multivariate soup. We wanted some way to quickly identify trends over time for multiple species.

I was directed towards the excellent work by the folks at Gallup and Healthways to create a human ‘well-being index’ or WBI. This index is a composite of six different sub-indices (e.g., physical health, work environment, etc.) that provides a detailed and real-time view of American health. The annual report provides many examples of elegant and informative graphs, which we used as motivation for our current problem. In particular, page 6 of the report has a figure on the right-hand side that shows the changes in state WBI rankings from 2011 to 2012. States are ranked by well-being in descending order with lines connecting states between the two years. One could obtain the same conclusions by examining a table but the figure provides a visually pleasing and entertaining way of evaluating several pieces of information.

I’ll start by explaining the format of the data we’re using. After preparation of the raw data with plyr and reshape2 by my colleague, a dataset was created with multiple rows for each time step and multiple columns for species, indexed by site. After splitting the data frame by site (using split), the data contained only the year and species data. The rows contained species frequency occurrence values for each time step. Here’s an example for one site (using random data):

step sp1 sp2 sp3 sp4
2003 1.3 2.6 7.7 3.9
2004 3.9 4.2 2.5 1.6
2005 0.4 2.6 3.3 11.0
2006 6.9 10.9 10.5 8.4

The actual data contained a few dozen species, not to mention multiple tables for each site. Sites were also designated as treatment or control. Visual examination of each table to identify trends related to treatment was not an option given the abundance of the data for each year and site.

We’ll start by creating a synthetic dataset that we’ll want to visualize. We’ll pretend that the data are for one site, since the plot function described below handles sites individually.

#create random data
set.seed(5)

#time steps
step<-as.character(seq(2007,2012))

#species names
sp<-paste0('sp',seq(1,25))

#random data for species frequency occurrence
sp.dat<-runif(length(step)*length(sp),0,15)

#create data frame for use with plot
sp.dat<-matrix(sp.dat,nrow=length(step),ncol=length(sp))
sp.dat<-data.frame(step,sp.dat)
names(sp.dat)<-c('step',sp)

The resulting data frame contains six years of data (by row) with randomized data on frequency occurrence for 25 species (every column except the first). In order to make the plot interesting, we can induce a correlation of some of our variables with the time steps. Otherwise, we’ll be looking at random data which wouldn’t show the full potential of the plots. Let’s randomly pick four of the variables and replace their values. Two variables will decrease with time and two will increase.

#reassign values of four variables

#pick random species names
vars<-sp[sp %in% sample(sp,4)]

#function for getting value at each time step
time.fun<-function(strt.val,steps,mean.val,lim.val){
	step<-1
	x.out<-strt.val
	while(step<steps){
		x.new<-x.out[step] + rnorm(1,mean=mean.val)
		x.out<-c(x.out,x.new)
		step<-step+1
		}
	if(lim.val<=0) return(pmax(lim.val,x.out))
	else return(pmin(lim.val,x.out))
	}

#use function to reassign variable values
sp.dat[,vars[1]]<-time.fun(14.5,6,-3,0) #start at 14.5, decrease rapidly
sp.dat[,vars[2]]<-time.fun(13.5,6,-1,0) #start at 13.5, decrease less rapidly
sp.dat[,vars[3]]<-time.fun(0.5,6,3,15) #start at 0.5, increase rapidly
sp.dat[,vars[4]]<-time.fun(1.5,6,1,15) #start at 1.5, increase less rapidly

The code uses the sample function to pick the species in the data frame. Next, I’ve written a function that simulates random variables that either decrease or increase for a given number of time steps. The arguments for the function are strt.val for the initial starting value at the first time step, steps are the total number of time steps to return, mean.val determines whether the values increase or decrease with the time steps, and lim.val is an upper or lower limit for the values. Basically, the function returns values at each time step that increase or decrease by a random value drawn from a normal distribution with mean as mean.val. Obviously we could enter the data by hand, but this way is more fun. Here’s what the data look like.

Now we can use the plot to visualize changes in species frequency occurrence for the different time steps. We start by importing the function, named plot.qual, into our workspace.

require(devtools)
source_gist('5281518')

par(mar=numeric(4),family='serif')
plot.qual(sp.dat,rs.ln=c(3,15))

The figure shows species frequency occurrence from 2007 to 2012. Species are ranked in order of decreasing frequency occurrence for each year, with year labels above each column. The lines connect the species between the years. Line color is assigned based on the ranked frequency occurrence values for species in the first time step to allow better identification of a species across time. Line width is also in proportion to the starting frequency occurrence value for each species at each time step. The legend on the bottom indicates the frequency occurrence values for different line widths.

We can see how the line colors and widths help us follow species trends. For example, we randomly chose species two to decrease with time. In 2007, we see that species two is ranked as the highest frequency occurrence among all species. We see a steady decline for each time step if we follow the species through the years. Finally, species two was ranked the lowest in 2012. The line widths also decrease for species two at each time step, illustrating the continuous decrease in frequency occurrence. Similarly, we randomly chose species 15 to increase with each time step. We see that it’s second to lowest in 2007 and then increases to third highest by 2012.

We can use the sp.names argument to isolate species of interest. We can clearly see the changes we’ve defined for our four random species.

plot.qual(sp.dat,sp.names=vars)

In addition to requiring the RColorBrewer and scales packages, the plot.qual function has several arguments that affect plotting:

x data frame or matrix with input data, first column is time step
x.locs minimum and maximum x coordinates for plotting region, from 0–1
y.locs minimum and maximum y coordinates for plotting region, from 0–1
steps character string of time steps to include in plot, default all
sp.names character string of species connections to display, default all
dt.tx logical value indicating if time steps are indicated in the plot
rsc logical value indicating if line widths are scaled proportionally to their value
ln.st numeric value for distance of lines from text labels, distance is determined automatically if not provided
rs.ln two-element numeric vector for rescaling line widths, defaults to one value if one element is supplied, default 3 to 15
ln.cl character string indicating color of lines, can use multiple colors or input to brewer.pal, default ‘RdYlGn’
alpha.val numeric value (0–1) indicating transparency of connections, default 0.7
leg logical value for plotting legend, default T, values are for the original data frame, no change for species or step subsets
... additional arguments to plot

I’ve attempted to include some functionality in the arguments, such as the ability to include date names and a legend, control over line widths, line color and transparency, and which species or years are shown. A useful aspect of the line coloring is the ability to incorporate colors from RColorBrewer or multiple colors as input to colorRampPalette. The following plots show some of these features.

par(mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,rs.ln=6,ln.cl='black',alpha=0.5,dt.tx=F,
	main='No color, no legend, no line rescaling')
#legend is removed if no line rescaling

par(mfrow=c(1,2),mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,steps=c('2007','2012'),x.locs=c(0.1,0.9),leg=F)
plot.qual(sp.dat,steps=c('2007','2012'),sp.names=vars,x.locs=c(0.1,0.9),leg=F)
title('Plot first and last time step for different species',outer=T,line=-1)

par(mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,y.locs=c(0.05,1),ln.cl=c('lightblue','purple','green'),
	main='Custom color ramp')

These plots are actually nothing more than glorified line plots, but I do hope the functionality and display helps better visualize trends over time. However, an important distinction is that the plots show ranked data rather than numerical changes in frequency occurrence as in a line plot. For example, species twenty shows a decrease in rank from 2011 to 2012 (see the second plot) yet the actual data in the table shows a continuous increase across all time steps. This does not represent an error in the plot, rather it shows a discrepancy between viewing data qualitatively or quantitatively. In other words, always be aware of what a figure actually shows you.

Please let me know if any bugs are encountered or if there are any additional features that could improve the function. Feel free to grab the code here.

About these ads

14 thoughts on “A nifty line plot to visualize multivariate time series

  1. Hi,

    First of all, great job. I am a PhD student and i am noticing that my interest lie more and more in the data analysis/visualization/presentation. I find this plot to be a very potent way of showing the type of data i am working with.

    I wanted to ask, as i do not fully follow your function, a few clarifying questions.
    1. The thickness of the line is the value at a given time step or the absolute change between two time steps (=columns)
    2. Is there a way to sort the column according to the value of each species?

    What i am planning on doing is having some species abundance data for 5-6 depths. I would like to show, that certain species abundance decreases with depth (the depth being my columns = steps, and the abundance the values). For that i would like to have each depth sorter from the highest to lowest abundance and the line thickness to represent the absolute change in the abundance.

    I would appreciate any

    • Hi Piotr,

      Thanks for your interest. To answer your questions:

      1. The line thickness values are based on a rescaling of all species values for all time steps relative to each other. If you look at line 24 ( https://gist.github.com/fawda123/5281518) you can see what I mean: x[,sp.col]<-rescale(x[,sp.col],rs.ln). The values between the time steps in the plots are the rescaled values for each time time step relative to all time steps. For example, the line thickness between 2007 and 2008 are the rescaled species values for 2007 relative to all other years. So, the plot actually doesn’t display data for the final time step in the lines, just the final rankings in the column.

      2. The columns are sorted based on descending values for species abundances at each time step, so this may be what you need. Perhaps I’m misunderstanding you, though. Feel free to elaborate.

      -Marcus

  2. Hey, Great job !
    a nifty visualization indeed !

    I ran your exact code from this article and i cant make it work. i’m getting the tables right but no lines are drawn. i installed the required packages.
    I’ve uploaded a picture for you to see (http://imgur.com/kYPRfZB).
    Thanks.

    • Hi there,

      Sorry it didn’t work! I can’t be of much help without more info… what version of R are you using, what operating system, etc? Perhaps try saving the plot as a pdf.

      -Marcus

      • I’m running R 2.15.2 (2012-10-26) with Windows 7 Home Premium (64 bit). Also using RStudio 0.97.316. Have you tried saving to pdf?

      • It’s working now and it’s looking great !
        You’r example worked when i exported to PDF.
        When i ran my example i used reshape.cast to prepare the data. so after a few attempts i realized i had to explicitly cast the output to data.frame and it worked. after playing around for a bit i also made it work from the console as well.

        I got delivery data. i used this graph to visualize the number of customers for each route in each day of the week. one thing i would really think could be killer here is if i could use the thickness of the lines to represent the amount of products delivered. this way i could visualize two important pieces of data on the same graph !

        if you have any intention of enabling that in the future i would be great full ! ( i would try and do it my self but i’m still a beginner in R).

        Thanks, Haki.

      • Awesome, glad you got it working. I thought I had tested the function with a matrix, so not sure why it didn’t work.

        Regards to your question… you’re saying you still want the route rankings to indicate number of customers (i.e., where the lines go, and order of columns), but line thickness should indicate number of shipments per route? I’d have to think about how this could be implemented in the function… might be tough, but not impossible.

    • I like the code. I got it working. I have many “species” in my data. How can I make it only plot specific species? Something like sp.names, but forces the plot to only plot a subset of the species.

      In your example, it would only plot sp. 2,11,15, and 20.

  3. A nifty area plot (or a bootleg of a ggplot geom) – R is my friend

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s