Process and observation uncertainty explained with R

Once upon a time I had grand ambitions of writing blog posts outlining all of the examples in the Ecological Detective.¹ A few years ago I participated in a graduate seminar series where we went through many of the examples in this book. I am not a population biologist by trade but many of the concepts were useful for not only helping me better understand core concepts of statistical modelling, but also for developing an appreciation of the limits of your data. Part of this appreciation stems from understanding sources and causes of uncertainty in estimates. Perhaps in the future I will focus more blog topics on other examples from The Ecological Detective, but for now I’d like to discuss an example that has recently been of interest in my own research.

Over the past few months I have been working with some colleagues to evaluate statistical power of biological indicators. These analyses are meant to describe the certainty within which a given level of change in an indicator is expected over a period of time. For example, what is the likelihood of detecting a 50% decline over twenty years considering that our estimate of the indicators are influenced by uncertainty? We need reliable estimates of the uncertainty to answer these types of questions and it is often useful to categorize sources of variation. Hilborn and Mangel describe process and observation uncertainty as two primary categories of noise in a data measurement. Process uncertainty describes noise related to actual or real variation in a measurement that a model does not describe. For example, a model might describe response of an indicator to changing pollutant loads but we lack an idea of seasonal variation that occurs naturally over time. Observation uncertainty is often called sampling uncertainty and describes our ability to obtain a precise data measurement. This is a common source of uncertainty in ecological data where precision of repeated surveys may be affected by several factors, such as skill level of the field crew, precision of sampling devices, and location of survey points. The effects of process and observation uncertainty on data measurements are additive such that the magnitude of both can be separately estimated.

The example I’ll focus on is described on pages 59–61 (the theory) and 90–92 (an example with pseudocode) in The Ecological Detective. This example describes an approach for conceptualizing the effects of uncertainty on model estimates, as opposed to methods for quantifying uncertainty from actual data. For the most part, this blog is an exact replica of the example, although I have tried to include some additional explanation where I had difficulty understanding some of the concepts. Of course, I’ll also include R code since that’s the primary motivation for my blog.

We start with a basic population model that describes population change over time. This is a theoretical model that, in practice, should describe some actual population, but is very simple for the purpose of learning about sources of uncertainty. From this basic model, we simulate sources of uncertainty to get an idea of their exact influence on our data measurements. The basic model without imposing uncertainty is as follows:

$\displaystyle N_{t+1}=sN_t + b_t$

where the population at time $t + 1$ is equal to the population at time $t$ multiplied by the survival probability $s$ plus the number of births at time $t$ . We call this the process model because it’s meant to describe an actual population process, i.e., population growth over time given birth and survival. We can easily create a function to model this process over a time series. As in the book example, we’ll use a starting population of fifty individuals, add 20 individuals from births at each time step, and use an 80% survival rate.

# simple pop model
proc_mod <- function(N_0 = 50, b = 20, s = 0.8, t = 50){
	
	N_out <- numeric(length = t)
	N_out[1] <- N_0
	
	for(step in 1:t) 
		N_out[step + 1] <- s*N_out[step] + b
	
	out <- data.frame(steps = 1:t, Pop = N_out[-1])
	
	return(out)
	
}

est <- proc_mod()

The model is pretty straightforward. A for loop is used to estimate the population for time steps one to fifty with a starting population size of fifty at time zero. Each time step multiplies the population estimate from the previous time step and adds twenty new individuals. You may notice that the function could easily be vectorized, but I’ve used a for loop to account for sources of uncertainty that are dependent on previous values in the time series. This will be explained below but for now the model only describes the actual process.

The results are assigned to the est object and then plotted.

library(ggplot2)
ggplot(est, aes(steps, Pop)) + 
	geom_point() + 
	theme_bw() + 
	ggtitle('N_0 = 50, s = 0.8, b = 20\n')

Fig: Population over time using a simplified process model.

In a world with absolute certainty, an actual population would follow this trend if our model accurately described the birth and survivorship rates. Suppose our model provided an incomplete description of the population. Hilborn and Mangel (p. 59) suggest that birth rates, for example, may fluctuate from year to year. This fluctuation is not captured by our model and represents a source of process uncertainty, or uncertainty caused by an incomplete description of the process. We can assume that the effect of this process uncertainty is additive to the population estimate at each time step:

$\displaystyle N_{t+1}=sN_t + b_t + W_t$

where the model remains the same but we’ve included an additional term, $W_t$ , to account for uncertainty. This uncertainty is random in the sense that we don’t know exactly how it will influence our estimate but we can describe it as a random variable from a known distribution. Suppose we expect random variation in birth rates for each time step to be normally distributed with mean zero and a given standard deviation. Population size at $t+1$ is the survivorship of the population at time $t$ plus the births accounting for random variation. An important point is that the random variation is additive throughout the time series. That is, if more births were observed for a given year due to random chance, the population would be larger the next year such that additional random variation at $t+1$ is added to the larger population. This is why a for loop is used because we can’t simulate uncertainty by adding a random vector all at once to the population estimates.

The original model is modified to include process uncertainty.

# simple pop model with process uncertainty 
proc_mod2 <- function(N_0 = 50, b = 20, s = 0.8, t = 50, 
        sig_w = 5){
	
	N_out <- numeric(length = t + 1)
	N_out[1] <- N_0
	
	sig_w <- rnorm(t, 0, sig_w)
	
	for(step in 1:t) 
		N_out[step + 1] <- s*N_out[step] + b + sig_w[step]
	
	out <- data.frame(steps = 1:t, Pop = N_out[-1])
	
	return(out)
	
}

set.seed(2)
est2 <- proc_mod2()

# plot the estimates
ggt <- paste0('N_0 = 50, s = 0.8, b = 20, sig_w = ',formals(proc_mod)$sig_w,'\n')
ggplot(est2, aes(steps, Pop)) + 
	geom_point() + 
	theme_bw() + 
	ggtitle(ggt)

Fig: Population over time using a simplified process model that includes process uncertainty.

We see considerable variation from the original model now that we’ve included process uncertainty. Note that the process uncertainty in each estimate is dependent on the estimate prior, as described above. This creates uncertainty that, although random, follows a pattern throughout the time series. We can look at an auto-correlation plot of the new estimates minus the actual population values to get an idea of this pattern. Observations that are closer to one another in the time series are correlated, as expected.

Fig: Auto-correlation between observations with process uncertainty at different time lags.

Adding observation uncertainty is simpler in that the effect is not propagated throughout the time steps. Rather, the uncertainty is added after the time series is generated. This makes intuitive because the observation uncertainty describes sampling error. For example, if we have an instrument malfunction one year that creates an unreliable estimate we can fix the instrument to get a more accurate reading the next year. However, suppose we have a new field crew the following year that contributes to uncertainty (e.g., wrong species identification). This uncertainty is not related to the year prior. Computationally, the model is as follows:

$\displaystyle N_{t+1}=sN_t + b_t$

$\displaystyle N^{*} = N + V$

where the model is identical to the deterministic model with the addition of observation uncertainty $V$ after the time series is calculated for fifty time steps. $N$ is the population estimate for the whole time series and $N^{*}$ is the estimate including observation uncertainty. We can simulate observation uncertainty using a random normal variable with assumed standard deviation as we did with process uncertainty, e.g., $V$ has length fifty with mean zero and standard deviation equal to five.

# model with observation uncertainty
proc_mod3 <- function(N_0 = 50, b = 20, s = 0.8, t = 50, sig_v = 5){
	
	N_out <- numeric(length = t)
	N_out[1] <- N_0
	
	sig_v <- rnorm(t, 0, sig_v)
	
	for(step in 1:t) 
		N_out[step + 1] <- s*N_out[step] + b
	
	N_out <- N_out + c(NA,sig_v)
	
	out <- data.frame(steps = 1:t, Pop = N_out[-1])
	
	return(out)
	
}

# get estimates
set.seed(3)
est3 <- proc_mod3()

# plot
ggt <- paste0('N_0 = 50, s = 0.8, b = 20, sig_v = ',
        formals(proc_mod3)$sig_v,'\n')
ggplot(est3, aes(steps, Pop)) + 
        geom_point() + 
	theme_bw() + 
	ggtitle(ggt)

Fig: Population over time using a simplified process model that includes observation uncertainty.

We can confirm that the observations are not correlated between the time steps, unlike the model with process uncertainty.

Fig: Auto-correlation between observations with observation uncertainty at different time lags.

Now we can create a model that includes both process and observation uncertainty by combining the above functions. The function is slightly tweaked to return include a data frame with all estimates: process model only, process model with process uncertainty, process model with observation uncertainty, process model with process and observation uncertainty.

# combined function
proc_mod_all <- function(N_0 = 50, b = 20, s = 0.8, t = 50, 
        sig_w = 5, sig_v = 5){
	
	N_out <- matrix(NA, ncol = 4, nrow = t + 1)
	N_out[1,] <- N_0
	
	sig_w <- rnorm(t, 0, sig_w)
	sig_v <- rnorm(t, 0, sig_v)
	
	for(step in 1:t){
		N_out[step + 1, 1] <- s*N_out[step] + b
		N_out[step + 1, 2] <- N_out[step, 1] + sig_w[step]
		}
	
	N_out[1:t + 1, 3]  <- N_out[1:t + 1, 1] + sig_v
		
	N_out[1:t + 1, 4]  <- N_out[1:t + 1, 2] + sig_v
	
	out <- data.frame(1:t,N_out[-1,])
	names(out) <- c('steps', 'mod_act', 'mod_proc', 'mod_obs', 'mod_all')
	
	return(out)
	
}

# create data
set.seed(2)
est_all <- proc_mod_all()

# plot the data
library(reshape2)
to_plo <- melt(est_all, id.var = 'steps')

# re-assign factor labels for plotting
to_plo$variable <- factor(to_plo$variable, levels = levels(to_plo$variable),
	labels = c('Actual','Pro','Obs','Pro + Obs'))

ggplot(to_plo, aes(steps, value)) + 
	geom_point() + 
	facet_wrap(~variable) + 
	ylab('Pop. estimate') + 
	theme_bw()

Fig: Population over time using a simplified process model that includes no uncertainty (actual), process uncertainty (Pro), observation uncertainty (Obs), and both (Pro + Obs).

On the surface, the separate effects of process and observation uncertainty on the estimates is similar, whereas the effects of adding both maximizes the overall uncertainty. We can quantify the extent to which the sources of uncertainty influence the estimates by comparing observations at time $t$ to observations at $t - 1$ . In other words, we can quantify the variance for each model by regressing observations separated by one time lag. We would expect the model that includes both sources of uncertainty to have the highest variance.

# comparison of mods
# create vectors for pop estimates at time t (t_1) and t - 1 (t_0)
t_1 <- est_all[2:nrow(est_all),-1]
t_1 <- melt(t_1, value.name = 'val_1')
t_0 <- est_all[1:(nrow(est_all)-1),-1]
t_0 <- melt(t_0, value.name = 'val_0')

#combine for plotting
to_plo2 <- cbind(t_0,t_1[,!names(t_1) %in% 'variable',drop = F])
head(to_plo2)
##   variable   val_0    val_1
## 1  mod_act 60.0000 68.00000
## 2  mod_act 68.0000 74.40000
## 3  mod_act 74.4000 79.52000
## 4  mod_act 79.5200 83.61600
## 5  mod_act 83.6160 86.89280
## 6  mod_act 86.8928 89.51424

# re-assign factor labels for plotting
to_plo2$variable <- factor(to_plo2$variable, levels = levels(to_plo2$variable),
	labels = c('Actual','Pro','Obs','Pro + Obs'))

# we don't want to plot the first process model
sub_dat <- to_plo2$variable == 'Actual'
ggplot(to_plo2[!sub_dat,], aes(val_0, val_1)) + 
	geom_point() + 
	facet_wrap(~variable) + 
	theme_bw() + 
	scale_y_continuous('Population size at time t') + 
	scale_x_continuous('Population size at time t - 1') +
	geom_abline(slope = 0.8, intercept = 20)

Fig: Evaluation of uncertainty in population estimates affected by process uncertainty (Pro), observation uncertainty (Obs), and both (Pro + Obs). The line indicates data from the actual process model without uncertainty.

A tabular comparison of the regressions for each plot provides a quantitative measure of the effect of uncertainty on the model estimates.

library(stargazer)
mods <- lapply(
	split(to_plo2,to_plo2$variable), 
	function(x) lm(val_1~val_0, data = x)
	)
stargazer(mods, omit.stat = 'f', title = 'Regression of population estimates at time $t$ against time $t - 1$ for each process model.  Each model except the first simulates different sources of uncertainty.', column.labels = c('Actual','Pro','Obs','Pro + Obs'), model.numbers = F)

The table tells us exactly what we would expect. Based on the r-squared values, adding more uncertainty decreases the explained variance of the models. Also note the changes in the parameter estimates. The actual model provides slope and intercept estimates identical to those we specified in the beginning ( $s = 0.8$ and $b = 20$ ). Adding more uncertainty to each model contributes to uncertainty in the parameter estimates such that survivorship is under-estimated and birth contributions are over-estimated.

It’s nice to use an arbitrary model where we can simulate effects of uncertainty, unlike situations with actual data where sources of uncertainty are not readily apparent. This example from The Ecological Detective is useful for appreciating the effects of uncertainty on parameter estimates in simple process models. I refer the reader to the actual text for more discussion regarding the implications of these analyses. Also, check out Ben Bolker’s text² (chapter 11) for more discussion with R examples.

Cheers,

Marcus

¹Hilborn R, Mangel M. 1997. The Ecological Detective: Confronting Models With Data. Monographs in Population Biology 28. Princeton University Press. Princeton, New Jersey. 315 pages.
²Bolker B. 2007. Ecological Models and Data in R. Princeton University Press. Princeton, New Jersey. 508 pages.

Collinearity and stepwise VIF selection

Collinearity, or excessive correlation among explanatory variables, can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model. For example, forward or backward selection of variables could produce inconsistent results, variance partitioning analyses may be unable to identify unique sources of variation, or parameter estimates may include substantial amounts of uncertainty. The temptation to build an ecological model using all available information (i.e., all variables) is hard to resist. Lots of time and money are exhausted gathering data and supporting information. We also hope to identify every significant variable to more accurately characterize relationships with biological relevance. Analytical limitations related to collinearity require us to think carefully about the variables we choose to model, rather than adopting a naive approach where we blindly use all information to understand complexity. The purpose of this blog is to illustrate use of some techniques to reduce collinearity among explanatory variables using a simulated dataset with a known correlation structure.

A simple approach to identify collinearity among explanatory variables is the use of variance inflation factors (VIF). VIF calculations are straightforward and easily comprehensible; the higher the value, the higher the collinearity. A VIF for a single explanatory variable is obtained using the r-squared value of the regression of that variable against all other explanatory variables:

$\displaystyle VIF_j=\frac{1}{1-R_j^2}$

where the $VIF$ for variable $j$ is the reciprocal of the inverse of $R^2$ from the regression. A VIF is calculated for each explanatory variable and those with high values are removed. The definition of ‘high’ is somewhat arbitrary but values in the range of 5-10 are commonly used.

Several packages in R provide functions to calculate VIF: vif in package HH, vif in package car, VIF in package fmsb, vif in package faraway, and vif in package VIF. The number of packages that provide VIF functions is surprising given that they all seem to accomplish the same thing. One exception is the function in the VIF package, which can be used to create linear models using VIF-regression. The nuts and bolts of this function are a little unclear since the documentation for the package is sparse. However, what this function does accomplish is something that the others do not: stepwise selection of variables using VIF. Removing individual variables with high VIF values is insufficient in the initial comparison using the full set of explanatory variables. The VIF values will change after each variable is removed. Accordingly, a more thorough implementation of the VIF function is to use a stepwise approach until all VIF values are below a desired threshold. For example, using the full set of explanatory variables, calculate a VIF for each variable, remove the variable with the single highest value, recalculate all VIF values with the new set of variables, remove the variable with the next highest value, and so on, until all values are below the threshold.

In this blog we’ll use a custom function for stepwise variable selection. I’ve created this function because I think it provides a useful example for exploring stepwise VIF analysis. The function is a wrapper for the vif function in fmsb. We’ll start by simulating a dataset with a known correlation structure.


require(MASS)
require(clusterGeneration)

set.seed(2)
num.vars<-15
num.obs<-200
cov.mat<-genPositiveDefMat(num.vars,covMethod="unifcorrmat")$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)

We’ve created fifteen ‘explanatory’ variables with 200 observations each. The mvrnorm function (MASS package) was used to create the data using a covariance matrix from the genPositiveDefMat function (clusterGeneration package). These functions provide a really simple approach to creating data matrices with arbitrary correlation structures. The covariance matrix was chosen from a uniform distribution such that some variables are correlated while some are not. A more thorough explanation about creating correlated data matrices can be found here. The correlation matrix for the random variables should look very similar to the correlation matrix from the actual values (as sample size increases, the correlation matrix approaches cov.mat).

Now we create our response variable as a linear combination of the explanatory variables. First, we create a vector for the parameters describing the relationship of the response variable with the explanatory variables. Then, we use some matrix algebra and a randomly distributed error term to create the response variable. This is the standard form for a linear regression model.


parms<-runif(num.vars,-10,10)
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

We would expect a regression model to indicate each of the fifteen explanatory variables are significantly related to the response variable, since we know the true relationship of y with each of the variables. However, our explanatory variables are correlated. What happens when we create the model?


lm.dat<-data.frame(y,rand.vars)
form.in<-paste('y ~',paste(names(lm.dat)[-1],collapse='+'))
mod1<-lm(form.in,data=lm.dat)
summary(mod1)

The model shows that only four of the fifteen explanatory variables are significantly related to the response variable (at $\alpha = 0.05$ ), yet we know that every one of the variables is related to y. As we’ll see later, the standard errors are also quite large.

We can try an alternative approach to building the model that accounts for collinearity among the explanatory variables. We can implement the custom VIF function as follows.


vif_func(in_frame=rand.vars,thresh=5,trace=T)

 var vif             
 X1  27.7352782054202
 X2  36.8947196546879
 X3  12.5694198086941
 X4  50.7385544899845
 X5  8.35069942629285
 X6  114.685122241139
 X7  67.3415420139211
 X8  153.597012767649
 X9  48.226662808638 
 X10 50.7371404106266
 X11 33.9720046917178
 X12 43.2541022358368
 X13 12.0823286959991
 X14 74.6186892947576
 X15 29.8722459010406

removed:  X8 153.597 

 var vif             
 X1  6.67306561938667
 X2  7.98347501302268
 X3  4.56187657632574
 X4  8.03048468634153
 X5  7.70736760805487
 X6  19.6743072270573
 X7  52.9521670096974
 X9  17.8683960730445
 X10 46.2484642889962
 X11 18.2479446141727
 X12 42.133697798185 
 X13 10.8973377491163
 X14 37.9296952803818
 X15 21.5847028917955

removed:  X7 52.95217 

 var vif             
 X1  6.54376168051204
 X2  7.68236114754164
 X3  4.04873004990332
 X4  5.08958904348524
 X5  2.65685239947949
 X6  9.12685384862522
 X9  2.89940351012031
 X10 4.24712217472346
 X11 4.45202381077724
 X12 12.8835110845825
 X13 1.92759852488083
 X14 6.02382346000219
 X15 8.33332235677386

removed:  X12 12.88351 

 var vif             
 X1  4.21690743815539
 X2  6.88058249786417
 X3  3.88265091747854
 X4  4.48146995293666
 X5  2.20144300270463
 X6  7.76775127218149
 X9  2.71324446993905
 X10 2.90517847805482
 X11 4.43541888871566
 X13 1.78221291029774
 X14 5.02289299193397
 X15 3.02196822776011

removed:  X6 7.767751 

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X9"  "X10" "X11" "X13" "X14" "X15"

The function uses three arguments. The first is a matrix or data frame of the explanatory variables, the second is the threshold value to use for retaining variables, and the third is a logical argument indicating if text output is returned as the stepwise selection progresses. The output indicates the VIF values for each variable after each stepwise comparison. The function calculates the VIF values for all explanatory variables, removes the variable with the highest value, and repeats until all VIF values are below the threshold. The final output is a list of variable names with VIF values that fall below the threshold. Now we can create a linear model using explanatory variables with less collinearity.


keep.dat<-vif_func(in_frame=rand.vars,thresh=5,trace=F)
form.in<-paste('y ~',paste(keep.dat,collapse='+'))
mod2<-lm(form.in,data=lm.dat)
summary(mod2)

The updated regression model is much improved over the original. We see an increase in the number of variables that are significantly related to the response variable. This increase is directly related to the standard error estimates for the parameters, which look at least 50% smaller than those in the first model. The take home message is that true relationships among variables will be masked if explanatory variables are collinear. This creates problems in model creation which lead to complications in model inference. Taking the extra time to evaluate collinearity is a critical first step to creating more robust ecological models.

Function and example code:

	vif_func<-function(in_frame,thresh=10,trace=T,wts=NULL,...){

	library(fmsb)

	if(any(!'data.frame' %in% class(in_frame))) in_frame<-data.frame(in_frame)

	if(is.null(wts))
	wts <- rep(1, ncol(in_frame))
	if(!is.null(wts))
	if(length(wts)!=ncol(in_frame)) stop('length of weights must equal number of variables')
	if(any(wts < 0))
	stop('weights must be positive')

	names(wts) <- names(in_frame)

	#get initial vif value for all comparisons of variables
	vif_init<-NULL
	var_names <- names(in_frame)
	for(val in var_names){
	wt <- wts[val]
	regressors <- var_names[-which(var_names == val)]
	form <- paste(regressors, collapse = '+')
	form_in <- formula(paste(val, '~', form))
	vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, ...)) / wt))
	}
	vif_max<-max(as.numeric(vif_init[,2]), na.rm = TRUE)

	if(vif_max < thresh){
	if(trace==T){ #print output of each iteration
	prmatrix(vif_init,collab=c('var','vif'),rowlab=rep('',nrow(vif_init)),quote=F)
	cat('\n')
	cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
	}
	return(var_names)
	}
	else{

	in_dat<-in_frame

	#backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
	while(vif_max >= thresh){

	vif_vals<-NULL
	var_names <- names(in_dat)

	for(val in var_names){
	wt <- wts[val]
	regressors <- var_names[-which(var_names == val)]
	form <- paste(regressors, collapse = '+')
	form_in <- formula(paste(val, '~', form))
	vif_add<- VIF(lm(form_in, data = in_dat, ...)) / wt
	vif_vals<-rbind(vif_vals,c(val,vif_add))
	}

	max_row<-which(vif_vals[,2] == max(as.numeric(vif_vals[,2]), na.rm = TRUE))[1]

	vif_max<-as.numeric(vif_vals[max_row,2])

	if(vif_max<thresh) break

	if(trace==T){ #print output of each iteration
	prmatrix(vif_vals,collab=c('var','vif'),rowlab=rep('',nrow(vif_vals)),quote=F)
	cat('\n')
	cat('removed: ',vif_vals[max_row,1],vif_max,'\n\n')
	flush.console()
	}

	in_dat<-in_dat[,!names(in_dat) %in% vif_vals[max_row,1], drop = F]

	if(ncol(in_dat)==1) break

	}

	return(names(in_dat))

	}

	}

view raw vif_fun.r hosted with ❤ by GitHub

R is my friend

'R challenge is like the Rubik's cube of our people' – Sarah Thompson

Tag Archives: simulation

Process and observation uncertainty explained with R

Collinearity and stepwise VIF selection