Data fishing: R and XML part 3

I’ve recently posted two blogs about gathering data from web pages using functions in R. Both examples showed how we can create our own custom functions to gather data about Minnesota lakes from the Lakefinder website. The first post was an example showing the use of R to create our own custom functions to get specific lake information and the second was a more detailed example showing how we can get data on fish populations in each lake. Each of the examples used functions in the XML package to parse the raw XML/HTML from the web page for each lake. I promised to show some results using information that was gathered with the catch function described in the second post. The purpose of this blog is to present some of this information and to further illustrate the utility of data mining using public data and functions in R.

I’ll begin by describing the functionality of a revised catch function and the data that can be obtained. I’ve revised the original function to return more detailed information and decrease processing time, so this explanation is not a repeat of my previous blog. First, we import the function from Github with some help from the RCurl package.

#import function from Github
require(RCurl)

root.url<-'https://gist.github.com/fawda123'
raw.fun<-paste(
  root.url,
  '4978092/raw/336671b8c3d44f744c3c8c87dd18e6d871a3b370/catch_fun_v2.r',
  sep='/'
  )

script<-getURL(raw.fun, ssl.verifypeer = FALSE)
eval(parse(text = script))
rm('script','raw.fun')

#function name 'catch.fun.v2'

The revised function is much simpler than the previous and requires only three arguments.

formals(catch.fun.v2)
#dows

#$trace
#T

#$rich.out
#F

The ‘dows’ argument is a character vector of the 8-digit lake identifier numbers used by MNDNR, ‘trace’ is a logical specifying if progress is returned as the function runs, and ‘rich.out’ is a logical specifying if a character vector of all fish species is returned as a single element list. The ‘dows’ argument can process multiple lakes so long as you have the numbers. The default value for the last argument will return a list of two data frames with complete survey information for each lake.

I revised the original function after realizing that I’ve made the task of parsing XML for individual species much too difficult. The original function searches species names and gear type to find the appropriate catch data. A simpler approach is to find the entire survey table by parsing with <table></table> tags. The original function parsed the XML using <tr></tr> tags for table rows and then identified the appropriate species and gear type within the rows. The latter approach required multiple selection criteria and was incredibly inefficient. The new function returns entire tables, which can be further modified within R as needed. Here’s an example for Loon lake in Aitkin county.

catch.fun.v2('01002400')

#01002400; 1 of 1 
#Time difference of 1.844106 secs
#$`01002400`
#$`01002400`$abundance
#Species        Net Caught Ave_wt
#1  White Sucker  Trap net     0.1   ND  
#2 Rainbow Trout  Trap net   trace   ND  
#3   Brook Trout  Trap net     4.6   ND  

#$`01002400`$length
#Species   0-5   6-8  9-11 12-14 15-19
#1   Brook Trout     0    28    36    22     3
#2 Rainbow Trout     0     0     0     1     0
#20-24 25-29   30+  Total
#1     0     0     0     89
#2     0     0     0      1

This is the same information that is on the lakefinder website. A two-element list is returned for each lake, each containing a data frame. The first element is a data frame of species catch (fish per net) and average weight (lbs per fish) using trap nets and gill nets (this example only contained trap nets). The second element is a length distribution table showing the number of fish caught within binned length categories (inches). A character string specifying ‘No survey’ will be returned if the lake isn’t found on the website or if no fish surveys are available. A character string indicating ‘No length table’ will be returned in the second list element if only the abundance table is available.

A list of the species present is returned if ‘rich.out’ is true.

catch.fun.v2('01002400',rich.out=T)

#01002400; 1 of 1 
#Time difference of 0.419024 secs
#$`01002400`
#[1] "Brook Trout"   "Rainbow Trout"
#[3] "White Sucker" 

Clearly the utility of this function is the ability to gather lake data for multiple lakes without visiting each individual page on Lakefinder. I’ve used the function to get lake data for every lake on the website. The dow numbers were obtained from a publicly available shapefile of lakes in Minnesota. This included 17,085 lakes, which were evaluated by the function in less than an hour. Of these lakes, 3,934 had fish survey information. One drawback of the function is that I haven’t included the date of the survey in the output (updated, see link at bottom of blog). The data that are returned span several years, and in some cases, the most recent lake survey could have been over 40 years ago.

Below is a presentation of some of the more interesting information obtained from the data. First, I’ve included two tables that show the top five lakes for abundance and average weight for two popular game fish. The tables are separated by gear type since each targets different habitats. Keep in mind that this information may not represent the most current population data.

Top five lakes in Minnesota for catch (fish per net) and average weight (lbs per fish) of walleye separated by net type.
Net type DOW Name County Caught Ave_wt
Gill 21014500 Chippewa Douglas 9.92
9005700 Eagle Carlton 9.89
69073100 Auto  St. Louis 9.75
31089600 Round  Itasca 9.67
47011900 Minnie Belle Meeker 9.67
11011400 Mitten Cass 9.38
31079300 Big Too Much Itasca 8.86
9004300 Moose Carlton 8.05
11019900 Hay Cass 7.82
11017700 Three Island Cass 7.61
Trap 87001900 Tyson Yellow Medicine 9.67
69019000 Big St. Louis 8.5
16025300 Deer Yard Cook 7.55
16064500 Toohey Cook 7.33
51004600 Shetek Murray 7.27
69017400 East Twin St. Louis 9.04
21016200 Freeborn Douglas 8.65
31054700 Smith Itasca 8.39
69054400 Dinham St. Louis 8.38
82002300 Lily Washington 8.11
Top five lakes in Minnesota for catch (fish per net) and average weight (lbs per fish) of northern pike separated by net type.
Net type DOW Name County Caught Ave_wt
Gill 4019600 Campbell Beltrami 9.89
11017400 Girl Cass 9.89
56091500 Prairie Otter Tail 9.83
69051300 Little Grand St. Louis 9.83
34015400 Nest Kandiyohi 9.8
38022900 Little Knife Lake 9.27
7007900 Benz Todd 9.16
29018400 Blue Hubbard 8.56
18043700 Portsmouth Mine Crow Wing 8.39
69011600 Mitchell St. Louis 8.3
Trap 61002900 Westport  Pope 9.56
16024800 Ward Cook 5.17
11028400 Horseshoe Cass 4.56
16025000 Mark Cook 4.33
62005500 Como Ramsey 4
9001600 Hay Hubbard 9.96
27018100 Dutch Hennepin 8.86
27001600 Harriet Hennepin 8.61
34003300 Ella Kandiyohi 8.6
34007900 Green Kandiyohi 8.6

The above tables illustrate the value of the information that can be obtained using data mining techniques in R, even though most anglers are already aware of the best fishing lakes in Minnesota. A more comprehensive assessment of the data on Lakefinder is shown below. The fist two figures show species richness and total biomass (lbs per net) for the approximate 4000 lakes with fish surveys. Total biomass for each lake was the summation of the product of average weight per fish and fish per net for all species using gill net data. Total biomass can be considered a measure of lake productivity, although in many cases, this value is likely determined by a single species with high abundance (e.g., carp or bullhead).

Summary of species richness (count) and total biomass (lbs per net) for lakes in Minnesota.

The next two figures show average species richness and total biomass for each county. Individual values for each county are indicated.

Summary of species richness (count) and total biomass (lbs per net) for lakes in Minnesota.

We see some pretty interesting patterns when we look at the statewide data. In general, species richness and biomass decreases along a southwest to northeast direction. These patterns vary in relation to the trophic state or productivity of each lake. We can also develop some interesting ecological hypotheses for the observed relationships. The more important implication for R users is that we’ve gathered and summarized piles of data from publicly available sources, data which can generate insight into causation. Interestingly, the maps were generated in R using the proprietary shapefile format created for ArcMap. The more I explore the packages in R that work with spatial data (e.g., sp, raster, maptools), the less I have worked with ArcMap. I’ve gone to great lengths to convince myself that the application of data mining tools available in R provide a superior alternative to commercial software. I hope that these simple examples have convinced you as well.

You can access version 2 of the catch function with the above code via my Github account. I’ve provided the following code to illustrate how the maps were produced. The data for creating the maps are not provided because this information can be obtained using the examples in this blog.

##
#first figure, summary of species richness and biomass for ~4000 lakes

require(raster)
require(sp)
require(maptools)
require(RColorBrewer) #for color ramps
require(scales) #for scaling points

par(mfrow=c(1,2),mar=numeric(4),family='serif')

#species richness
plot(counties,col='lightgrey') #get county shapefile from DNR data deli, link in blog

col.fun<-colorRampPalette(brewer.pal(11,'RdYlBu'))
col.pts<-rev(col.fun(nrow(lake.pts)))[rank(lake.pts$spp_rich)] #lake.pts can be created using 'catch.fun.v2'
rsc.pts<-rescale(lake.pts$spp_rich,c(0.5,5))
points(lake.pts$UTM_X,lake.pts$UTM_Y,cex=rsc.pts,col=col.pts,pch=19)
title(sub='Species richness per lake',line=-2,cex.sub=1.5)

leg.text<-as.numeric(summary(lake.pts$spp_rich))[-4]
leg.cex<-seq(0.5,5,length=5)
leg.col<-rev(col.fun(5))
legend(650000,5250000,legend=leg.text,pt.cex=leg.cex,col=leg.col,
       title='',pch=19,bty='n', cex=1.4,y.intersp=1.4,inset=0.11)

#scale bar
axis(side=3,
     at=c(420000,520000,620000),
     line=-4,labels=c('0 km','100 km','200 km'),
     cex.axis=1
  )

#total biomass
plot(counties,col='lightgrey')

col.fun<-colorRampPalette(brewer.pal(11,'RdYlBu'))
col.pts<-rev(col.fun(nrow(lake.pts)))[rank(lake.pts$ave.bio)]
rsc.pts<-rescale(lake.pts$ave.bio,c(0.5,5))
points(lake.pts$UTM_X,lake.pts$UTM_Y,cex=rsc.pts,col=col.pts,pch=19)
title(sub='Total biomass per lake (lbs/net)',line=-2,cex.sub=1.5)

leg.text<-as.numeric(summary(lake.pts$ave.bio))[-4]
leg.cex<-seq(0.5,5,length=5)
leg.col<-rev(col.fun(5))
legend(650000,5250000,legend=leg.text,pt.cex=leg.cex,col=leg.col,
       title='',pch=19,bty='n', cex=1.4,y.intersp=1.4,inset=0.11)

##
#second figure, county summaries

lake.pts.sp<-lake.pts #data from combination of lake polygon shapefile and catch data from catch.fun.v2
coordinates(lake.pts.sp)<-cbind(lake.pts.sp$UTM_X,lake.pts.sp$UTM_Y)

test<-over(counties,lake.pts.sp[,c(7,10)],fn=mean) #critical function for summarizing county data

co.rich<-test[,'spp_rich']
co.bio<-test[,'ave.bio']

my.cols<-function(val.in){
  pal<-colorRampPalette(brewer.pal(11,'RdYlBu')) #create color vector
  out<-numeric(length(val.in))
  out[!is.na(val.in)]<-rev(pal(sum(!is.na(val.in))))[rank(val.in[!is.na(val.in)])]  
  out[out=='0']<-'lightgrey'
  out
}

#make plots
par(mfrow=c(1,2),mar=numeric(4),family='serif')
plot(counties,col=my.cols(co.rich))
text.coor<-coordinates(counties)[!is.na(co.rich),]
text(text.coor[,1],text.coor[,2],format(co.rich[!is.na(co.rich)],nsmall=1,
     digits=1),cex=0.8)
axis(side=3,
     at=c(420000,520000,620000),
     line=-4,labels=c('0 km','100 km','200 km'),
     cex.axis=1
)
title(sub='Average species richness by county',line=-2,cex.sub=1.5)

plot(counties,col=my.cols(co.bio))
text.coor<-coordinates(counties)[!is.na(co.bio),]
text(text.coor[,1],text.coor[,2],format(co.bio[!is.na(co.bio)],nsmall=1,
     digits=1),cex=0.8)
title(sub='Average biomass by county',line=-2,cex.sub=1.5)

Update: I’ve updated the catch function to return survey dates.

Collinearity and stepwise VIF selection

Collinearity, or excessive correlation among explanatory variables, can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model. For example, forward or backward selection of variables could produce inconsistent results, variance partitioning analyses may be unable to identify unique sources of variation, or parameter estimates may include substantial amounts of uncertainty. The temptation to build an ecological model using all available information (i.e., all variables) is hard to resist. Lots of time and money are exhausted gathering data and supporting information. We also hope to identify every significant variable to more accurately characterize relationships with biological relevance. Analytical limitations related to collinearity require us to think carefully about the variables we choose to model, rather than adopting a naive approach where we blindly use all information to understand complexity. The purpose of this blog is to illustrate use of some techniques to reduce collinearity among explanatory variables using a simulated dataset with a known correlation structure.

A simple approach to identify collinearity among explanatory variables is the use of variance inflation factors (VIF). VIF calculations are straightforward and easily comprehensible; the higher the value, the higher the collinearity. A VIF for a single explanatory variable is obtained using the r-squared value of the regression of that variable against all other explanatory variables:

\displaystyle VIF_j=\frac{1}{1-R_j^2}

where the VIF for variable j is the reciprocal of the inverse of R^2 from the regression. A VIF is calculated for each explanatory variable and those with high values are removed. The definition of ‘high’ is somewhat arbitrary but values in the range of 5-10 are commonly used.

Several packages in R provide functions to calculate VIF: vif in package HH, vif in package car, VIF in package fmsb, vif in package faraway, and vif in package VIF. The number of packages that provide VIF functions is surprising given that they all seem to accomplish the same thing. One exception is the function in the VIF package, which can be used to create linear models using VIF-regression. The nuts and bolts of this function are a little unclear since the documentation for the package is sparse. However, what this function does accomplish is something that the others do not: stepwise selection of variables using VIF. Removing individual variables with high VIF values is insufficient in the initial comparison using the full set of explanatory variables. The VIF values will change after each variable is removed. Accordingly, a more thorough implementation of the VIF function is to use a stepwise approach until all VIF values are below a desired threshold. For example, using the full set of explanatory variables, calculate a VIF for each variable, remove the variable with the single highest value, recalculate all VIF values with the new set of variables, remove the variable with the next highest value, and so on, until all values are below the threshold.

In this blog we’ll use a custom function for stepwise variable selection. I’ve created this function because I think it provides a useful example for exploring stepwise VIF analysis. The function is a wrapper for the vif function in fmsb. We’ll start by simulating a dataset with a known correlation structure.

require(MASS)
require(clusterGeneration)

set.seed(2)
num.vars<-15
num.obs<-200
cov.mat<-genPositiveDefMat(num.vars,covMethod="unifcorrmat")$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)

We’ve created fifteen ‘explanatory’ variables with 200 observations each. The mvrnorm function (MASS package) was used to create the data using a covariance matrix from the genPositiveDefMat function (clusterGeneration package). These functions provide a really simple approach to creating data matrices with arbitrary correlation structures. The covariance matrix was chosen from a uniform distribution such that some variables are correlated while some are not. A more thorough explanation about creating correlated data matrices can be found here. The correlation matrix for the random variables should look very similar to the correlation matrix from the actual values (as sample size increases, the correlation matrix approaches cov.mat).

cor_tab

Now we create our response variable as a linear combination of the explanatory variables. First, we create a vector for the parameters describing the relationship of the response variable with the explanatory variables. Then, we use some matrix algebra and a randomly distributed error term to create the response variable. This is the standard form for a linear regression model.

parms<-runif(num.vars,-10,10)
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

We would expect a regression model to indicate each of the fifteen explanatory variables are significantly related to the response variable, since we know the true relationship of y with each of the variables. However, our explanatory variables are correlated. What happens when we create the model?

lm.dat<-data.frame(y,rand.vars)
form.in<-paste('y ~',paste(names(lm.dat)[-1],collapse='+'))
mod1<-lm(form.in,data=lm.dat)
summary(mod1)

mod1

The model shows that only four of the fifteen explanatory variables are significantly related to the response variable (at \alpha = 0.05), yet we know that every one of the variables is related to y. As we’ll see later, the standard errors are also quite large.

We can try an alternative approach to building the model that accounts for collinearity among the explanatory variables. We can implement the custom VIF function as follows.

vif_func(in_frame=rand.vars,thresh=5,trace=T)

 var vif             
 X1  27.7352782054202
 X2  36.8947196546879
 X3  12.5694198086941
 X4  50.7385544899845
 X5  8.35069942629285
 X6  114.685122241139
 X7  67.3415420139211
 X8  153.597012767649
 X9  48.226662808638 
 X10 50.7371404106266
 X11 33.9720046917178
 X12 43.2541022358368
 X13 12.0823286959991
 X14 74.6186892947576
 X15 29.8722459010406

removed:  X8 153.597 

 var vif             
 X1  6.67306561938667
 X2  7.98347501302268
 X3  4.56187657632574
 X4  8.03048468634153
 X5  7.70736760805487
 X6  19.6743072270573
 X7  52.9521670096974
 X9  17.8683960730445
 X10 46.2484642889962
 X11 18.2479446141727
 X12 42.133697798185 
 X13 10.8973377491163
 X14 37.9296952803818
 X15 21.5847028917955

removed:  X7 52.95217 

 var vif             
 X1  6.54376168051204
 X2  7.68236114754164
 X3  4.04873004990332
 X4  5.08958904348524
 X5  2.65685239947949
 X6  9.12685384862522
 X9  2.89940351012031
 X10 4.24712217472346
 X11 4.45202381077724
 X12 12.8835110845825
 X13 1.92759852488083
 X14 6.02382346000219
 X15 8.33332235677386

removed:  X12 12.88351 

 var vif             
 X1  4.21690743815539
 X2  6.88058249786417
 X3  3.88265091747854
 X4  4.48146995293666
 X5  2.20144300270463
 X6  7.76775127218149
 X9  2.71324446993905
 X10 2.90517847805482
 X11 4.43541888871566
 X13 1.78221291029774
 X14 5.02289299193397
 X15 3.02196822776011

removed:  X6 7.767751 

 [1] "X1"  "X2"  "X3"  "X4"  "X5"  "X9"  "X10" "X11" "X13" "X14" "X15"

The function uses three arguments. The first is a matrix or data frame of the explanatory variables, the second is the threshold value to use for retaining variables, and the third is a logical argument indicating if text output is returned as the stepwise selection progresses. The output indicates the VIF values for each variable after each stepwise comparison. The function calculates the VIF values for all explanatory variables, removes the variable with the highest value, and repeats until all VIF values are below the threshold. The final output is a list of variable names with VIF values that fall below the threshold. Now we can create a linear model using explanatory variables with less collinearity.

keep.dat<-vif_func(in_frame=rand.vars,thresh=5,trace=F)
form.in<-paste('y ~',paste(keep.dat,collapse='+'))
mod2<-lm(form.in,data=lm.dat)
summary(mod2)

mod1

The updated regression model is much improved over the original. We see an increase in the number of variables that are significantly related to the response variable. This increase is directly related to the standard error estimates for the parameters, which look at least 50% smaller than those in the first model. The take home message is that true relationships among variables will be masked if explanatory variables are collinear. This creates problems in model creation which lead to complications in model inference. Taking the extra time to evaluate collinearity is a critical first step to creating more robust ecological models.

Function and example code: