Animating neural networks from the nnet package

My research has allowed me to implement techniques for visualizing multivariate models in R and I wanted to share some additional techniques I’ve developed, in addition to my previous post. For example, I think a primary obstacle towards developing a useful neural network model is an under-appreciation of the effects model parameters have on model performance. Visualization techniques enable us to more fully understand how these models are developed, which can lead to more careful determination of model parameters.

Neural networks characterize relationships between explanatory and response variables through a series of training iterations. Training of the model is initiated using a random set of weighted connections between the layers. The predicted values are compared to the observed values in the training dataset. Based on the errors or differences between observed and predicted, the weights are adjusted to minimize prediction error using a ‘back-propagation algorithm’1. The process is repeated with additional training iterations until some stopping criteria is met, such as a maximum number of training iterations set by the user or convergence towards a minimum error value. This process is similar in theory to the fitting of least-squares regression lines, although neural networks incorporate many more parameters. Both methods seek to minimize an error criterion.

The development of neural networks for prediction also requires he use of test datasets for evaluating the predictive performance of the trained models. Neural networks are so effective at relating variables that they commonly characterize relationships in the training data that have no actual relevance. In other words, these models identify and characterize noise in the data rather than actual (i.e., biological, ecological, physical, etc.) relationships among variables. This concept is similar to creating a loess function with a decreasingly restrictive smoothing parameter (span argument) to model a polynomial relationship between two variables (code here):

The loess function from top left to bottom right shows how a model can be overfit. The function becomes increasingly overfit to the observed data such that the modeled relationship between x and y characterizes peculiarities of the data that have no actual relevance. The function with a decreasingly restrictive smoothing parameter might better approximate the observed points (lower root mean square error) but the ability to generalize is compromised (i.e., bias-variance tradeoff).

A neural network model that is not overfit to the training data should predict observations in a test dataset that are close to the observed. Textbooks often indicate the importance of evaluating model performance on a test dataset2, yet methods are not available in R for visualizing model performance as a function of training iterations. This blog will present an approach for visualizing the training process of neural network models using the plot.nnet function in my previous blog. I’ll also be creating animations using the animation package created by Yihui Xie.

Let’s start by creating our simulated dataset (same as my previous posts). We create eight random variables with an arbitrary correlation structure and then create a response variable as a linear combination of the eight random variables. The parameters describing the true relationship between the response and explanatory variables are known and were drawn randomly from a uniform distribution.

require(clusterGeneration)
require(nnet)

set.seed(2)
num.vars<-8
num.obs<-1000

#arbitrary correlation matrix and random variables
cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)
parms<-runif(num.vars,-10,10)

#response variable as linear combination of random variables and random error term
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

#prepare data for use with nnet and plot.nnet
rand.vars<-data.frame(rand.vars)
y<-data.frame((y-min(y))/(max(y)-min(y)))
names(y)<-'y'

We’ve rescaled our response variable from 0-1, as in the previous blog, so we can make use of a logistic transfer function. Here we use a linear transfer function as before, so rescaling is not necessary but this does allow for comparability should we choose to use a logistic transfer function.

Now we import the plot function from Github (thanks to Harald’s tip in my previous post).

require(devtools)
source_gist('5086859') #loaded as 'plot.nnet' in the workspace

Now we’re going to train a neural network model from one to 100 training iterations to see how error changes for a training and test dataset. The nnet function returns training errors but only after each of ten training iterations (trace argument set to true). We could plot these errors but the resolution is rather poor. We also can’t track error for the test dataset. Instead, we can use a loop to train the model up to a given set of iterations. After each iteration, we can save the error for each dataset in an object. The loop will start again with the maximum number of training iterations as n + 1, where n was the maximum number of training iterations in the previous loop. This is not the most efficient way to track errors but it works for our purposes.

First, we’ll have to define a vector that identifies our training and test dataset so we can evaluate the ability of our model to generalize. The nnet function returns the final error as the sums of squares for the model residuals. We’ll use that information to calculate the root mean square error (RMSE). Additionally, we’ll have to get model predictions to determine RMSE for the test dataset.

#get vector of T/F that is used to subset data for training
set.seed(3) 
train.dat<-rep(F,num.obs)
train.dat[sample(num.obs,600)]<-T

num.it<-100 #number of training iterations

err.dat<-matrix(ncol=2,nrow=num.it) #object for appending errors

for(i in 1:num.it){
	
	#to monitor progress
	cat(i,'\n') 
	flush.console()
	
	#to ensure same set of random starting weights are used each time
	set.seed(5)
	
	#temporary nnet model
	mod.tmp<-nnet(rand.vars[train.dat,],y[train.dat,],size=10,linout=T,maxit=i,trace=F)
	
	#training and test errors
	train.err<-sqrt(mod.tmp$value/sum(train.dat))
	test.pred<-predict(mod.tmp,newdata=rand.vars[!train.dat,])
	test.err<-sqrt(sum((test.pred-y[!train.dat,])^2)/sum(!train.dat))
	
	#append error after each iteration
	err.dat[i,]<-c(train.err,test.err)
		
	}

The code fills a matrix of error values with rows for one to 100 training iterations. The first column is the prediction error for the training dataset and the second column is the prediction error for the test dataset.

We can use a similar approach to create our animation by making calls to plot.nnet as training progresses. Similarly, we can plot the error data from our error matrix to see how error changes after each training iteration. We’ll create our animation using the saveHTML function from the animation package. One challenge with the animation package is that it requires ImageMagick. The functions will not work unless this software is installed on your machine. ImageMagick is open source and easy to install, the only trick is to make sure the path where the software is installed is added to your system’s list of environmental variables. This blog offers a fantastic introduction to using ImageMagick with R.

This code can be used to create an HTML animation once ImageMagick is installed.

#make animation
require(animation)

num.it<-100

#this vector is needed for proper scaling of weights in the plot 
max.wt<-0
for(i in 1:num.it){
  set.seed(5)
  mod.tmp<-nnet(rand.vars[train.dat,],y[train.dat,,drop=F],size=10,linout=T,maxit=i-1,trace=F)
  if(max(abs(mod.tmp$wts))>max.wt) max.wt<-max(abs(mod.tmp$wts))
  }

ani.options(outdir = 'C:/Users/Marcus/Desktop/') #your local path
saveHTML(
  {
  for(i in 1:num.it){
  
  	#plot.nnet
    par(mfrow=c(1,2),mar=c(0,0,1,0),family='serif')
    set.seed(5)
    mod.tmp<-nnet(rand.vars[train.dat,],y[train.dat,,drop=F],size=10,linout=T,maxit=i-1,trace=F)
    rel.rsc.tmp<-1+6*max(abs(mod.tmp$wts))/max(max.wt)    #this vector sets scale for line thickness
  	plot.nnet(mod.tmp,nid=T,rel.rsc=rel.rsc.tmp,alpha.val=0.8,
  		pos.col='darkgreen',neg.col='darkblue',circle.col='brown',xlim=c(0,100))

  	#plot error
    plot(seq(0,num.it)[c(1:i)],err.dat[,2][c(1:i)],type='l',col='darkgrey',bty='n',ylim=c(-0.1,0.5),
    	xlim=c(-15,num.it+15),axes=F,ylab='RMSE',xlab='Iteration',lwd=2,cex.lab=1.3)
    axis(side=1,at=seq(0,num.it,length=5),line=-4)
    mtext(side=1,'Iteration',line=-2)
    axis(side=2,at=c(seq(0,0.4,length=5)),line=-3)
    mtext(side=2,'RMSE',line=-1)
    lines(seq(0,num.it)[c(1:i)],err.dat[c(1:i),1],type='l',col='red',lwd=2)
    lines(seq(0,num.it)[c(1:i)],err.dat[c(1:i),2],type='l',col='green',lwd=2)
  
  	title(paste('Iteration',i,'of',num.it),outer=T,line=-2)
  	}
  },
  
  interval = 0.15, ani.width = 700, ani.height = 350, ani.opts='controls'
  
  )

The first argument to saveHTML is an expression where we can include anything we want to plot. We’ve created an animation of 100 training iterations by successively increasing the total number of training iterations using a loop. The code prior to saveHTML is not necessary for plotting but ensures that the line widths for the neural interpretation diagram are consistent from the first to last training iteration. In other words, scaling of the line widths is always in proportion to the maximum weight that was found in all of the training iterations. This creates a plot that allows for more accurate visualization of weight changes. The key arguments in the call to saveHTML are interval (frame rate), ani.width (width of animation), and ani.height (height of animation). You can view the animation by clicking on the image below.

The animation on the left shows the weight changes with additional training iterations using a neural interpretation diagram. Line color and width is proportional to sign and magnitude of the weight. Positive weights are dark green and negative weights are dark blue. The animation on the right shows the prediction errors (RMSE) for the training (red) and test (green) datasets. Note the rapid decline in error with the first few training iterations, then marginal improvement with additional training. Also note the increase in error for the test dataset.

We can also look at the predicted values compared to the observed values as a function of model training. The values should be very similar if our neural network model has sufficiently characterized the relationships between our variables. We can use the saveHTML function again to create the animation for a series of loops through the training iterations. We’ll plot the predicted values against the observed values for each dataset with some help from ggplot2.

#make animations, pred vs obs
require(ggplot2)

#create color scheme for group points, overrides default for ggplot
myColors<-c('red','green')
names(myColors)<-c('training','test')
colScale<-scale_colour_manual(name = "grps",values = myColors)

#function for getting rmse of prediction, used to plot text
rmse.fun<-function(dat.in,sub.in)
	format(sqrt(sum((dat.in[sub.in,]-y[sub.in,])^2)/sum(sub.in)),nsmall=4,digits=4)

options(outdir = 'C:/Users/Marcus/Desktop/') #your local path
saveHTML(
  {
  for(i in 1:num.it){
  
     #temporary nnet mod for the loop
     mod.tmp<-nnet(rand.vars[train.dat,],y[train.dat,,drop=F],size=10,linout=T,maxit=i,trace=F)
    
     #predicted values and factor variable defining training and test datasets
     pred.vals<-predict(mod.tmp,new=rand.vars)	
     sub.fac<-rep('test',length=num.obs)
     sub.fac[train.dat]<-'training'
     sub.fac<-factor(sub.fac,c('training','test'))
    
     #data frame to plot
     plot.tmp<-data.frame(cbind(sub.fac,pred.vals,y))
     names(plot.tmp)<-c('Group','Predicted','Observed')
  	
     #plot title
     title.val<-paste('Iteration',i,'of',num.it)  
  
     #RMSE of prediction, used for plot text, called using geom_text in ggplot
     rmse.val<-data.frame(
        Group=factor(levels(sub.fac),c('training','test')),
        rmse=paste('rmse =',c(rmse.fun(pred.vals,train.dat),rmse.fun(pred.vals,!train.dat))),
        x.val=rep(0.9,2),
        y.val=rep(0.1,2)
        )
  	
     #ggplot object
     p.tmp<-ggplot(data=plot.tmp, aes(x=Observed,y=Predicted,group=Group,colour=Group)) + 
        geom_abline() +
        geom_point(size=6,alpha=0.5) +
        scale_x_continuous(limits=c(0,1)) +
        scale_y_continuous(limits=c(0,1)) + 
        facet_grid(.~Group) +
        theme_bw() +
        colScale +
        geom_text(data=rmse.val, aes(x=x.val,y=y.val,label=rmse,group=Group),colour="black",size=4) +
        ggtitle(title.val) +
        theme(
           plot.title=element_text(vjust=2,size=18),
           legend.position='none'
           )
  	
     print(p.tmp)
  	
     }
  	
  },
  	
  interval = 0.15, ani.width = 700, ani.height = 350, ani.opts='controls'
  	
  )

Click the image below to view the animation.

The animation on the left shows the values for the training dataset and the animation on the right shows the values for the test dataset. Both datasets are poorly predicted in the first ten or so training iterations. Predictions stabilize and approximate the observed values with additional training. The model seems to have good abilities to generalize on the test dataset once training is completed. However, notice the RMSE values for each dataset and how they change with training. RMSE values are always decreasing for the training dataset, whereas RMSE values decrease initially for the test dataset but slowly begin to increase with additional training. This increase in RMSE for the test dataset is an early indication that overfitting may occur. Our first animation showed similar results. What would happen if we allowed training to progress up to 2000 training iterations? Over-training will improve predictive performance on the training dataset at the expense of performance on the test dataset. The model will be overfit and unable to generalize meaningful relationships among variables. The point is, always be cautious of models that have not been evaluated with an independent dataset.

1Rumelhart, D.E., Hinton, G.E., Williams, R.J. 1986. Learning representations by back-propagating errors. Nature. 323(6088):533-536.
2Venables, W.N., Ripley, B.D. 2002. Modern Applied Statistics with S, 4th ed. Springer, New York.

Visualizing neural networks from the nnet package

Note: Please see the update to this post!

Neural networks have received a lot of attention for their abilities to ‘learn’ relationships among variables. They represent an innovative technique for model fitting that doesn’t rely on conventional assumptions necessary for standard models and they can also quite effectively handle multivariate response data. A neural network model is very similar to a non-linear regression model, with the exception that the former can handle an incredibly large amount of model parameters. For this reason, neural network models are said to have the ability to approximate any continuous function.

I’ve been dabbling with neural network models for my ‘research’ over the last few months. I’ll admit that I was drawn to the approach given the incredible amount of hype and statistical voodoo that is attributed to these models. After working with neural networks, I’ve found that they are often no better, and in some cases much worse, than standard statistical techniques for predicting relationships among variables. Regardless, the foundational theory of neural networks is pretty interesting, especially when you consider how computer science and informatics has improved our ability to create useful models.

R has a few packages for creating neural network models (neuralnet, nnet, RSNNS). I have worked extensively with the nnet package created by Brian Ripley. The functions in this package allow you to develop and validate the most common type of neural network model, i.e, the feed-forward multi-layer perceptron. The functions have enough flexibility to allow the user to develop the best or most optimal models by varying parameters during the training process. One major disadvantage is an inability to visualize the models. In fact, neural networks are commonly criticized as ‘black-boxes’ that offer little insight into causative relationships among variables. Recent research, primarily in the ecological literature, has addressed this criticism and several approaches have since been developed to ‘illuminate the black-box’.1

As far as I know, none of the recent techniques for evaluating neural network models are available in R. Özesmi and Özemi (1999)2 describe a neural interpretation diagram (NID) to visualize parameters in a trained neural network model. These diagrams allow the modeler to qualitatively examine the importance of explanatory variables given their relative influence on response variables, using model weights as inference into causation. More specifically, the diagrams illustrate connections between layers with width and color in proportion to magnitude and direction of each weight. More influential variables would have lots of thick connections between the layers.

In this blog I present a function for plotting neural networks from the nnet package. This function allows the user to plot the network as a neural interpretation diagram, with the option to plot without color-coding or shading of weights. The neuralnet package also offers a plot method for neural network objects and I encourage interested readers to check it out. I have created the function for the nnet package given my own preferences for aesthetics, so its up to the reader to choose which function to use. I plan on preparing this function as a contributed package to CRAN, but thought I’d present it in my blog first to gauge interest.

Let’s start by creating an artificial dataset to build the model. This is a similar approach that I used in my previous blog about collinearity. We create eight random variables with an arbitrary correlation struction and then create a response variable as a linear combination of the eight random variables.

require(clusterGeneration)

set.seed(2)
num.vars<-8
num.obs<-1000

#arbitrary correlation matrix and random variables
cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)
parms<-runif(num.vars,-10,10)

#response variable as linear combination of random variables and random error term
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

Now we can create a neural network model using our synthetic dataset. The nnet function can input a formula or two separate arguments for the response and explanatory variables (we use the latter here). We also have to convert the response variable to a 0-1 continuous scale in order to use the nnet function properly (via the linout argument, see the documentation).

require(nnet)

rand.vars<-data.frame(rand.vars)
y<-data.frame((y-min(y))/(max(y)-min(y)))
names(y)<-'y'

mod1<-nnet(rand.vars,y,size=10,linout=T)

We’ve created a neural network model with ten nodes in the hidden layer and a linear transfer function for the response variable. All other arguments are as default. The tricky part of developing an optimal neural network model is identifying a combination of parameters that produces model predictions closest to observed. Keeping all other arguments as default is not a wise choice but is a trivial matter for this blog. Next, we use the plot function now that we have a neural network object.

First we import the function from my Github account (aside: does anyone know a better way to do this?).

#import function from Github
require(RCurl)

root.url<-'https://gist.githubusercontent.com/fawda123'
raw.fun<-paste(
  root.url,
  '5086859/raw/cc1544804d5027d82b70e74b83b3941cd2184354/nnet_plot_fun.r',
  sep='/'
  )
script<-getURL(raw.fun, ssl.verifypeer = FALSE)
eval(parse(text = script))
rm('script','raw.fun')

The function is now loaded in our workspace as plot.nnet. We can use the function just by calling plot since it recognizes the neural network object as input.

par(mar=numeric(4),mfrow=c(1,2),family='serif')
plot(mod1,nid=F)
plot(mod1)

The image on the left is a standard illustration of a neural network model and the image on the right is the same model illustrated as a neural interpretation diagram (default plot). The black lines are positive weights and the grey lines are negative weights. Line thickness is in proportion to magnitude of the weight relative to all others. Each of the eight random variables are shown in the first layer (labelled as X1-X8) and the response variable is shown in the far right layer (labelled as ‘y’). The hidden layer is labelled as H1 through H10, which was specified using the size argument in the nnet function. B1 and B2 are bias layers that apply constant values to the nodes, similar to intercept terms in a regression model.

The function has several arguments that affect the plotting method:

mod.in model object for input created from nnet function
nid logical value indicating if neural interpretation diagram is plotted, default T
all.out logical value indicating if all connections from each response variable are plotted, default T
all.in logical value indicating if all connections to each input variable are plotted, default T
wts.only logical value indicating if connections weights are returned rather than a plot, default F
rel.rsc numeric value indicating maximum width of connection lines, default 5
circle.cex numeric value indicating size of nodes, passed to cex argument, default 5
node.labs logical value indicating if text labels are plotted, default T
line.stag numeric value that specifies distance of connection weights from nodes
cex.val numeric value indicating size of text labels, default 1
alpha.val numeric value (0-1) indicating transparency of connections, default 1
circle.col text value indicating color of nodes, default ‘lightgrey’
pos.col text value indicating color of positive connection weights, default ‘black’
neg.col text value indicating color of negative connection weights, default ‘grey’
... additional arguments passed to generic plot function

Most of the arguments can be tweaked for aesthetics. We’ll illustrate using a neural network model created in the example code for the nnet function:

#example data and code from nnet function examples
ir<-rbind(iris3[,,1],iris3[,,2],iris3[,,3])
targets<-class.ind( c(rep("s", 50), rep("c", 50), rep("v", 50)) )
samp<-c(sample(1:50,25), sample(51:100,25), sample(101:150,25))
ir1<-nnet(ir[samp,], targets[samp,], size = 2, rang = 0.1,decay = 5e-4, maxit = 200)

#plot the model with different default values for the arguments
par(mar=numeric(4),family='serif')
plot.nnet(ir1,pos.col='darkgreen',neg.col='darkblue',alpha.val=0.7,rel.rsc=15,
circle.cex=10,cex=1.4,
	circle.col='brown')

The neural network plotted above shows how we can tweak the arguments based on our preferences. This figure also shows that the function can plot neural networks with multiple response variables (‘c’, ‘s’, and ‘v’ in the iris dataset).

Another useful feature of the function is the ability to get the connection weights from the original nnet object. Admittedly, the weights are an attribute of the original function but they are not nicely arranged. We can get the weight values directly with the plot.nnet function using the wts.only argument.

plot.nnet(ir1,wts.only=T)

$`hidden 1`
[1]  0.2440625  0.5161636  1.9179850 -2.8496175
[5] -1.4606777

$`hidden 2`
[1]   9.222086   6.350143   7.896035 -11.666666
[5]  -8.531172

$`out 1`
[1]  -5.868639 -10.334504  11.879805

$`out 2`
[1] -4.6083813  8.8040909  0.1754799

$`out 3`
[1]   6.2251557  -0.3604812 -12.7215625

The function returns a list with all connections to the hidden layer (hidden 1 through hidden 2) and all connections to the output layer (out1 through out3). The first value in each element of the list is the weight from the bias layer.

The last features of the plot.nnet function we’ll look at are the all.in and all.out arguments. We can use these arguments to plot connections for specific variables of interest. For example, what if we want to examine the weights that are relevant only for sepal width (‘Sepal W.’) and virginica sp. (‘v’)?

plot.nnet(ir1,pos.col='darkgreen',neg.col='darkblue',alpha.val=0.7,rel.rsc=15,
	circle.cex=10,cex=1.4,circle.col='brown',all.in='Sepal W.',all.out='v')

This exercise is relatively trivial for a small neural network model but can be quite useful for a larger model.

Feel free to grab the function from Github (linked above). I encourage suggestions on ways to improve its functionality. I will likely present more quantitative methods of evaluating neural networks in a future blog, so stay tuned!

1Olden, J.D., Jackson, D.A. 2002. Illuminating the ‘black-box’: a randomization approach for understanding variable contributions in artificial neural networks. Ecological Modelling. 154:135-150.
2Özesmi, S.L., Özesmi, U. 1999. An artificial neural network approach to spatial habitat modeling with interspecific interaction. Ecological Modelling. 116:15-31.