A nifty line plot to visualize multivariate time series

A few days ago a colleague came to me for advice on the interpretation of some data. The dataset was large and included measurements for twenty-six species at several site-year-plot combinations. A substantial amount of effort had clearly been made to ensure every species at every site over several years was documented. I don’t pretend to hide my excitement of handling large, messy datasets, so I offered to lend a hand. We briefly discussed a few options and came to the conclusion that simple was better, at least in an exploratory sense. The challenge was to develop a plot that was informative, but also maintained the integrity of the data. Specifically, we weren’t interested in condensing information using synthetic axes created from a multivariate soup. We wanted some way to quickly identify trends over time for multiple species.

I was directed towards the excellent work by the folks at Gallup and Healthways to create a human ‘well-being index’ or WBI. This index is a composite of six different sub-indices (e.g., physical health, work environment, etc.) that provides a detailed and real-time view of American health. The annual report provides many examples of elegant and informative graphs, which we used as motivation for our current problem. In particular, page 6 of the report has a figure on the right-hand side that shows the changes in state WBI rankings from 2011 to 2012. States are ranked by well-being in descending order with lines connecting states between the two years. One could obtain the same conclusions by examining a table but the figure provides a visually pleasing and entertaining way of evaluating several pieces of information.

I’ll start by explaining the format of the data we’re using. After preparation of the raw data with plyr and reshape2 by my colleague, a dataset was created with multiple rows for each time step and multiple columns for species, indexed by site. After splitting the data frame by site (using split), the data contained only the year and species data. The rows contained species frequency occurrence values for each time step. Here’s an example for one site (using random data):

step	sp1	sp2	sp3	sp4
2003	1.3	2.6	7.7	3.9
2004	3.9	4.2	2.5	1.6
2005	0.4	2.6	3.3	11.0
2006	6.9	10.9	10.5	8.4

The actual data contained a few dozen species, not to mention multiple tables for each site. Sites were also designated as treatment or control. Visual examination of each table to identify trends related to treatment was not an option given the abundance of the data for each year and site.

We’ll start by creating a synthetic dataset that we’ll want to visualize. We’ll pretend that the data are for one site, since the plot function described below handles sites individually.

#create random data
set.seed(5)

#time steps
step<-as.character(seq(2007,2012))

#species names
sp<-paste0('sp',seq(1,25))

#random data for species frequency occurrence
sp.dat<-runif(length(step)*length(sp),0,15)

#create data frame for use with plot
sp.dat<-matrix(sp.dat,nrow=length(step),ncol=length(sp))
sp.dat<-data.frame(step,sp.dat)
names(sp.dat)<-c('step',sp)

The resulting data frame contains six years of data (by row) with randomized data on frequency occurrence for 25 species (every column except the first). In order to make the plot interesting, we can induce a correlation of some of our variables with the time steps. Otherwise, we’ll be looking at random data which wouldn’t show the full potential of the plots. Let’s randomly pick four of the variables and replace their values. Two variables will decrease with time and two will increase.

#reassign values of four variables

#pick random species names
vars<-sp[sp %in% sample(sp,4)]

#function for getting value at each time step
time.fun<-function(strt.val,steps,mean.val,lim.val){
	step<-1
	x.out<-strt.val
	while(step<steps){
		x.new<-x.out[step] + rnorm(1,mean=mean.val)
		x.out<-c(x.out,x.new)
		step<-step+1
		}
	if(lim.val<=0) return(pmax(lim.val,x.out))
	else return(pmin(lim.val,x.out))
	}

#use function to reassign variable values
sp.dat[,vars[1]]<-time.fun(14.5,6,-3,0) #start at 14.5, decrease rapidly
sp.dat[,vars[2]]<-time.fun(13.5,6,-1,0) #start at 13.5, decrease less rapidly
sp.dat[,vars[3]]<-time.fun(0.5,6,3,15) #start at 0.5, increase rapidly
sp.dat[,vars[4]]<-time.fun(1.5,6,1,15) #start at 1.5, increase less rapidly

The code uses the sample function to pick the species in the data frame. Next, I’ve written a function that simulates random variables that either decrease or increase for a given number of time steps. The arguments for the function are strt.val for the initial starting value at the first time step, steps are the total number of time steps to return, mean.val determines whether the values increase or decrease with the time steps, and lim.val is an upper or lower limit for the values. Basically, the function returns values at each time step that increase or decrease by a random value drawn from a normal distribution with mean as mean.val. Obviously we could enter the data by hand, but this way is more fun. Here’s what the data look like.

Now we can use the plot to visualize changes in species frequency occurrence for the different time steps. We start by importing the function, named plot.qual, into our workspace.

require(devtools)
source_gist('5281518')

par(mar=numeric(4),family='serif')
plot.qual(sp.dat,rs.ln=c(3,15))

The figure shows species frequency occurrence from 2007 to 2012. Species are ranked in order of decreasing frequency occurrence for each year, with year labels above each column. The lines connect the species between the years. Line color is assigned based on the ranked frequency occurrence values for species in the first time step to allow better identification of a species across time. Line width is also in proportion to the starting frequency occurrence value for each species at each time step. The legend on the bottom indicates the frequency occurrence values for different line widths.

We can see how the line colors and widths help us follow species trends. For example, we randomly chose species two to decrease with time. In 2007, we see that species two is ranked as the highest frequency occurrence among all species. We see a steady decline for each time step if we follow the species through the years. Finally, species two was ranked the lowest in 2012. The line widths also decrease for species two at each time step, illustrating the continuous decrease in frequency occurrence. Similarly, we randomly chose species 15 to increase with each time step. We see that it’s second to lowest in 2007 and then increases to third highest by 2012.

We can use the sp.names argument to isolate species of interest. We can clearly see the changes we’ve defined for our four random species.

plot.qual(sp.dat,sp.names=vars)

In addition to requiring the RColorBrewer and scales packages, the plot.qual function has several arguments that affect plotting:

`x`	data frame or matrix with input data, first column is time step
`x.locs`	minimum and maximum x coordinates for plotting region, from 0–1
`y.locs`	minimum and maximum y coordinates for plotting region, from 0–1
`steps`	character string of time steps to include in plot, default all
`sp.names`	character string of species connections to display, default all
`dt.tx`	logical value indicating if time steps are indicated in the plot
`rsc`	logical value indicating if line widths are scaled proportionally to their value
`ln.st`	numeric value for distance of lines from text labels, distance is determined automatically if not provided
`rs.ln`	two-element numeric vector for rescaling line widths, defaults to one value if one element is supplied, default 3 to 15
`ln.cl`	character string indicating color of lines, can use multiple colors or input to `brewer.pal`, default ‘RdYlGn’
`alpha.val`	numeric value (0–1) indicating transparency of connections, default 0.7
`leg`	logical value for plotting legend, default T, values are for the original data frame, no change for species or step subsets
`rnks`	logical value for showing only the ranks in the text labels
`...`	additional arguments to plot

I’ve attempted to include some functionality in the arguments, such as the ability to include date names and a legend, control over line widths, line color and transparency, and which species or years are shown. A useful aspect of the line coloring is the ability to incorporate colors from RColorBrewer or multiple colors as input to colorRampPalette. The following plots show some of these features.

par(mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,rs.ln=6,ln.cl='black',alpha=0.5,dt.tx=F,
	main='No color, no legend, no line rescaling')
#legend is removed if no line rescaling

par(mfrow=c(1,2),mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,steps=c('2007','2012'),x.locs=c(0.1,0.9),leg=F)
plot.qual(sp.dat,steps=c('2007','2012'),sp.names=vars,x.locs=c(0.1,0.9),leg=F)
title('Plot first and last time step for different species',outer=T,line=-1)

par(mar=c(0,0,1,0),family='serif')
plot.qual(sp.dat,y.locs=c(0.05,1),ln.cl=c('lightblue','purple','green'),
	main='Custom color ramp')

These plots are actually nothing more than glorified line plots, but I do hope the functionality and display helps better visualize trends over time. However, an important distinction is that the plots show ranked data rather than numerical changes in frequency occurrence as in a line plot. For example, species twenty shows a decrease in rank from 2011 to 2012 (see the second plot) yet the actual data in the table shows a continuous increase across all time steps. This does not represent an error in the plot, rather it shows a discrepancy between viewing data qualitatively or quantitatively. In other words, always be aware of what a figure actually shows you.

Please let me know if any bugs are encountered or if there are any additional features that could improve the function. Feel free to grab the code here.

22 thoughts on “A nifty line plot to visualize multivariate time series”

Piotr Starnawski (@Starnas88) says:

April 3, 2013 at 4:07 am

Hi,

First of all, great job. I am a PhD student and i am noticing that my interest lie more and more in the data analysis/visualization/presentation. I find this plot to be a very potent way of showing the type of data i am working with.

I wanted to ask, as i do not fully follow your function, a few clarifying questions.
1. The thickness of the line is the value at a given time step or the absolute change between two time steps (=columns)
2. Is there a way to sort the column according to the value of each species?

What i am planning on doing is having some species abundance data for 5-6 depths. I would like to show, that certain species abundance decreases with depth (the depth being my columns = steps, and the abundance the values). For that i would like to have each depth sorter from the highest to lowest abundance and the line thickness to represent the absolute change in the abundance.

I would appreciate any

Reply
- beckmw says:
  
  April 3, 2013 at 7:24 am
  
  Hi Piotr,
  
  Thanks for your interest. To answer your questions:
  
  1. The line thickness values are based on a rescaling of all species values for all time steps relative to each other. If you look at line 24 ( https://gist.github.com/fawda123/5281518) you can see what I mean: x[,sp.col]<-rescale(x[,sp.col],rs.ln). The values between the time steps in the plots are the rescaled values for each time time step relative to all time steps. For example, the line thickness between 2007 and 2008 are the rescaled species values for 2007 relative to all other years. So, the plot actually doesn’t display data for the final time step in the lines, just the final rankings in the column.
  
  2. The columns are sorted based on descending values for species abundances at each time step, so this may be what you need. Perhaps I’m misunderstanding you, though. Feel free to elaborate.
  
  -Marcus
  
  Reply
  - Piotr Starnawski (@Starnas88) says:
    
    April 8, 2013 at 3:49 am
    
    Hi Marcus,
    
    No, you got my point and the explanation was also clear. Thank you for that
    
    -Peter
haki says:

April 5, 2013 at 2:23 am

Hey, Great job !
a nifty visualization indeed !

I ran your exact code from this article and i cant make it work. i’m getting the tables right but no lines are drawn. i installed the required packages.
I’ve uploaded a picture for you to see (http://imgur.com/kYPRfZB).
Thanks.

Reply
- beckmw says:
  
  April 5, 2013 at 7:51 am
  
  Hi there,
  
  Sorry it didn’t work! I can’t be of much help without more info… what version of R are you using, what operating system, etc? Perhaps try saving the plot as a pdf.
  
  -Marcus
  
  Reply
  - haki says:
    
    April 5, 2013 at 8:02 am
    
    here you go :
    R version 2.15.3 (2013-03-01)
    Platform: i386-w64-mingw32/i386 (32-bit)
  - beckmw says:
    
    April 5, 2013 at 8:19 am
    
    I’m running R 2.15.2 (2012-10-26) with Windows 7 Home Premium (64 bit). Also using RStudio 0.97.316. Have you tried saving to pdf?
  - haki says:
    
    April 5, 2013 at 10:41 am
    
    It’s working now and it’s looking great !
    You’r example worked when i exported to PDF.
    When i ran my example i used reshape.cast to prepare the data. so after a few attempts i realized i had to explicitly cast the output to data.frame and it worked. after playing around for a bit i also made it work from the console as well.
    
    I got delivery data. i used this graph to visualize the number of customers for each route in each day of the week. one thing i would really think could be killer here is if i could use the thickness of the lines to represent the amount of products delivered. this way i could visualize two important pieces of data on the same graph !
    
    if you have any intention of enabling that in the future i would be great full ! ( i would try and do it my self but i’m still a beginner in R).
    
    Thanks, Haki.
  - beckmw says:
    
    April 5, 2013 at 12:04 pm
    
    Awesome, glad you got it working. I thought I had tested the function with a matrix, so not sure why it didn’t work.
    
    Regards to your question… you’re saying you still want the route rankings to indicate number of customers (i.e., where the lines go, and order of columns), but line thickness should indicate number of shipments per route? I’d have to think about how this could be implemented in the function… might be tough, but not impossible.
the armadillo says:

April 25, 2013 at 5:32 pm

I’d second Haki’s request–would be awesome to have that option.

Reply
- JT says:
  
  June 4, 2014 at 4:58 pm
  
  I like the code. I got it working. I have many “species” in my data. How can I make it only plot specific species? Something like sp.names, but forces the plot to only plot a subset of the species.
  
  In your example, it would only plot sp. 2,11,15, and 20.
  
  Reply
  - beckmw says:
    
    June 5, 2014 at 8:02 am
    Hi JT, could you just subset the original data frame before plotting?
    
    sub_cols <- c('step', 'sp2', 'sp11', 'sp15', 'sp20') plot.qual(sp.dat[, sub_cols], rs.ln=c(3,15))
  - JT says:
    
    June 6, 2014 at 7:41 am
    
    That worked.
A nifty area plot (or a bootleg of a ggplot geom) – R is my friend
ECT says:

February 11, 2015 at 12:27 pm

Hello,
I am enjoying these figures for plotting microbial community composition over time. I was wondering if there was any way to incorporate the final frequencies (or any values) for each species in the final column? The line thicknesses are informative for abundance throughout, but the last and final step is missing this visual. Thanks!

Reply
- beckmw says:
  
  February 16, 2015 at 9:46 am
  Hi ECT,
  
  This is a really quick way to do this w/o modifying the plot code… I may modify it in the future to replace the species values with arbitrary values. For now, you could maybe try this?
```
# get a sorted vector of the final values
library(dplyr) 
finals <- sp.dat[6, ] %>% 
	select(-step) %>% 
	c %>% 
	unlist %>% 
	sort %>% 
	round(1) %>% 
	c(., 'Final')

# plot as before but add an extra column at the end
par(mar=numeric(4),family='serif')
plot.qual(sp.dat,rs.ln=c(3,15), x.locs = c(0.1, 0.8))
x <- rep(0.9, times = ncol(sp.dat))
y <- seq(0.05, 1, length = ncol(sp.dat))
text(x, y, finals)
```
  Cheers,
  
  Marcus
  Reply
  - ECT says:
    
    February 16, 2015 at 11:51 am
    
    Thank you Marcus,
    In running this new code, I’m getting the error: could not find function “%>%” . Do you have any suggestions why?
    Thank you again,
    ECT
ECT says:

February 16, 2015 at 11:57 am

I’ve solved it. Thanks again. This is very helpful!

Reply
Sandra Alvarez says:

February 9, 2016 at 6:28 am

Hi,

Thanks so much for these code! It looks great, unfortunately I am not able to reproduce it myself. I am using R Studio Version 0.99.486 – © 2009-2015 on a Mac with OS X El Capitan version 10.11.1

I am able to follow all the code exactly as in your example, but when trying to plot I get this error:

Error in FUN(X[[i]], …) :
only defined on a data frame with all numeric variables

I tried both with my own data and with your the code from your example and get the same error, do you know what might be happening?

I have also tried it on R version 3.2.2 (2015-08-14) and get the same error.

Thanks a lot
Sandra

Reply
- beckmw says:
  
  February 9, 2016 at 8:59 am
  
  Hi Sandra,
  
  Try it again now. There was an issue with a newer version of the scales package.
  
  -Marcus
  
  Reply

He Marcus, I really like this function! Do you know how can I get rid of the species names completely and only retain the ranks?

Thank you.
Anni

Hi Anni,

I added a new argument that removes the species names. the ‘rnks’ argument set to TRUE will only show the ranks.

plot.qual(sp.dat, rnks = T)

	plot.qual<-function(x,x.locs=c(0.01,0.99),y.locs=c(0,1),steps=NULL,sp.names=NULL,dt.tx=T,rsc=T,
	ln.st=NULL,rs.ln=c(3,15),ln.cl='RdYlGn',alpha=0.7,leg=T,rnks=FALSE,…){

	require(RColorBrewer)
	require(scales)

	if(length(x.locs) != 2 \| length(y.locs) != 2)
	stop('x and y dimensions must be two-element vectors')

	if(x.locs[1]<0 \| x.locs[2]>1 \| y.locs[1]<0 \| y.locs[2]>1)
	stop('x and y dimensions must in range of 0–1')

	dim.x<-c(0,1) #plot dims x
	dim.y<-c(0,1) #plot dims y
	wrn.val<-F

	x[,1]<-as.character(x[,1])
	tot.sp<-ncol(x)-1
	sp.col<-2:ncol(x)

	#rescale if T, sort legend for later
	sp.orig<-x[,sp.col]
	if(length(rs.ln)==1) rsc<-F
	if(rsc) x[,sp.col]<-rescale(as.matrix(x[,sp.col]),rs.ln)
	if(rsc==F & leg) leg<-F #no legend if line widths aren't rescaled

	#reorder species columns, add rank as integer
	first.ord<-order(x[1,sp.col],decreasing=T)
	x[,sp.col]<-x[,sp.col][,first.ord]
	names(x)[sp.col]<-names(x)[sp.col][first.ord]
	names(x)[sp.col]<-paste(1:tot.sp,names(x)[sp.col],sep='. ')

	#list of spp by date, arranged in decreasing order for each date
	dt.dat.srt<-vector('list',nrow(x))
	names(dt.dat.srt)<-x[,1]
	for(val in 1:nrow(x)){
	tmp<-t(x[val,sp.col])
	tmp<-tmp[order(tmp,decreasing=T),,drop=F]
	dt.dat.srt[[val]]<-tmp
	}

	#initiate plot object
	plot(dim.x,dim.y,type='n',axes=F,xlab='',ylab='',…)

	#subset for steps, if provided
	if(!is.null(steps)) dt.dat.srt<-dt.dat.srt[names(dt.dat.srt) %in% steps]

	#plot legend
	if(leg){
	y.locs[1]<-0.05*diff(y.locs)+y.locs[1]
	leg.txt<-format(round(seq(min(sp.orig),max(sp.orig),length=5),2),nsmall=2,digits=2)
	leg.wds<-seq(rs.ln[1],rs.ln[2],length=5)
	legend('bottom',(y.locs[1]-y.olds)/2,col=alpha('black',alpha),lwd=leg.wds,legend=leg.txt,bty='n',
	horiz=T)
	}

	#x locations
	x.vals<-rep(seq(x.locs[1],x.locs[2],length=length(dt.dat.srt)),each=tot.sp)
	x.vals<-split(x.vals,x.vals)

	#y locations, rearranged in loop, exception if dates are plotted
	if(dt.tx) y.vals<-rev(seq(y.locs[1],y.locs[2],length=tot.sp+1))[-1]
	else y.vals<-rev(seq(y.locs[1],y.locs[2],length=tot.sp))

	#get line colors
	if(length(ln.cl)==1)
	if(ln.cl %in% row.names(brewer.pal.info)){
	pal.num<-brewer.pal.info[row.names(brewer.pal.info) == ln.cl,1]
	ln.cl<-brewer.pal(pal.num,ln.cl)
	}
	line.cols<-alpha(colorRampPalette(ln.cl)(tot.sp),alpha)

	#define distance of lines from labels
	if(is.null(ln.st)){
	str.max<-max(strwidth(row.names(dt.dat.srt[[1]])))
	if(diff(x.locs)-length(dt.dat.srt)*str.max < 0){
	warning('not enough space for lines between columns')
	wrn.val<-T
	}
	else
	ln.st<-0.2*str.max + str.max/2
	}

	for(val in 1:(length(dt.dat.srt)-1)){

	#temp data to plot
	plt.tmp<-dt.dat.srt[c(val,val+1)]
	x.tmp<-x.vals[c(val,val+1)]

	#plot temp text for column, remove spp if rnks
	rowtxt <- row.names(plt.tmp[[1]])
	if(rnks)
	rowtxt <- gsub('([1-9])\\..', '\\1', rowtxt)
	text(x.tmp[[1]],y.vals,rowtxt)

	if(val == length(dt.dat.srt)-1){
	rowtxt <- row.names(plt.tmp[[2]])
	if(rnks)
	rowtxt <- gsub('([1-9])\\..', '\\1', rowtxt)
	text(x.tmp[[2]],y.vals,rowtxt)
	if(dt.tx){
	dt.txt<-substitute(italic(x),list(x=names(plt.tmp)[2]))
	text(unique(x.tmp[[2]]),y.locs[2],dt.txt)
	}
	}

	if(dt.tx){
	dt.txt<-substitute(italic(x),list(x=names(plt.tmp)[1]))
	text(unique(x.tmp[[1]]),y.locs[2],dt.txt)
	}

	srt.ln.y<-match(row.names(plt.tmp[[1]]),row.names(plt.tmp[[2]]))

	#if no line rescale, use first element of rs.ln
	if(rsc) lwd.val<-plt.tmp[[1]][,1]
	else lwd.val<-rep(rs.ln[1],tot.sp)

	#vector for species selection of line segments
	if(is.null(sp.names)) sel.sp<-rep(T,tot.sp)
	else{
	sel.names<-unlist(lapply(strsplit(row.names(plt.tmp[[1]]),' '),function(x) x[2]))
	sel.sp<-(sel.names %in% sp.names)
	}

	#for lines
	if(!wrn.val)
	segments(
	x.tmp[[1]][sel.sp]+ln.st,
	y.vals[sel.sp],
	x.tmp[[2]][sel.sp]-ln.st,
	y.vals[srt.ln.y][sel.sp],
	col=line.cols[sel.sp],
	lwd=lwd.val[sel.sp]
	)

	#resort color vector for next colummn
	srt.cl.y<-match(row.names(plt.tmp[[2]]),row.names(plt.tmp[[1]]))
	line.cols<-line.cols[srt.cl.y]

	}

	}

view raw

plot_qual.r

hosted with ❤ by GitHub

-Marcus

R is my friend

'R challenge is like the Rubik's cube of our people' – Sarah Thompson

A nifty line plot to visualize multivariate time series

22 thoughts on “A nifty line plot to visualize multivariate time series”

Leave a comment Cancel reply

Share this:

Related

22 thoughts on “A nifty line plot to visualize multivariate time series”

Leave a comment Cancel reply