Data fishing: R and XML part 2

I’m constantly amazed at what can be done using free software, such as R, and more importantly, what can be done with data that are available on the internet. In an earlier post, I confessed to my sedentary lifestyle immersed in code, so my opinion regarding the utility of open-source software is perhaps biased. None the less, I’m convinced that more people would start using and continue to use R once they’ve realized the limitless applications.

My first post highlighted what can be done using the XML package, or more generally, what can be done ‘stealing’ data from the internet using free software and publicly available data. I return to this concept in this blog to show you, and hopefully convince you, that learning how to code your own functions for data mining is a worthwhile endeavor. My last example showed how we can get depth data for lakes from the Lakefinder website maintained by the Minnesota Department of Natural Resources. I’ll admit the example was boring and trivial, but my intention was to introduce the concept as the basis for further blogs. The internet contains an over-abundance of information and R provides several useful tools to gather and analyze this information.

After seeing this post about data mining with Twitter (check the video!), I was motivated to create a more interesting example than my previous posting using LakeFinder. Minnesota, the land of 11,842 lakes, has some of the best fishing in the country. The ostensible purpose of Lakefinder is to provide anglers access to lake data gathered by MNDNR. Most people don’t care about lake depth, so I developed a more flexible function that accesses data from fish surveys. The fisheries division at MNDNR collects piles and piles of fish data and Lakefinder should have the most current survey information for managed lakes (although I don’t know how often the data are updated). Each year, local offices conduct trap net surveys in nearshore areas and gill net surveys to sample fish in open water. An example of the data is shown for Christmas Lake under ‘Fish Sampled for the 2007 Survey Year’. Species are listed in each row and species catch is listed in the columns. Fish per net can be considered a measure of abundance and is often correlated with angler catch. Lakes with higher fish per net, in either trap or gill nets, likely have better fishing.

The purpose of this blog is to describe a custom R function for gathering information about fish catch off LakeFinder. Those outside of Minnesota will have little use for the data gathered by this function. However, I think the most important implication is that anyone with a computer has the potential to develop customized functions to gather a potentially limitless amount of information from the internet. I hope the results I’ve obtained using this function will convince you of this fact. I’ll save the juicy tidbits for my next blog (where are the best lakes?).

What started as a straightforward idea quickly turned into a rather nasty problem handling exceptions when the website contained information that didn’t match the search criteria used in the function. The trick to writing a useful function is to incorporate enough flexibility so that it can accommodate all cases or forms of the data. For example, a function that can only process data frames will fail if it tries to process a list. This concept is central to programming and helps us understand why the fish catch function seems so complex. I had to create a function that could handle every instance in the fish catch table for each lake where the species data existed in a form that made it unambiguous to identify (i.e., it was obvious that the species was or was not in the survey for a given gear type).

Here are some examples illustrating how the function can identify the correct species and net type for each lake:

Fish name Net type Catch
Bluegill Gill 0.6
Northern Pike Trap 12.2
Bluegill Trap 32.1
Walleye Gill 12.6

This example is easy enough if we are looking for bluegill in trap nets. All we have to is pull out the rows with Bluegill in the first column, then identify the row with the correct net type, and append the value in the third column to our output. The next example illustrates a case that isn’t so clear.

Fish name Net type Catch
Bluegill Gill 0.6
Trap 32.1
Northern Pike Trap 12.2
Walleye Gill 12.6

Say we’ve written the function to identify fish and net data using only criteria that can handle data in the first table. We run into problems if we apply the same criteria to identify the catch of bluegill in trap nets for the second table (a common form for the Lakefinder survey data). We can’t identify all bluegill rows using species name in the first column since it isn’t labelled for trap nets. Nor can we identify bluegill based on our designated net type (Northern Pike were also caught in trap nets). One workaround is to pull out the single row with Bluegill and the following row. Then we can check which row of the two contains trap nets in the second column This assumes an empty row name for species is bluegill, which is a valid assumption considering data in the previous row are for bluegill. What if the species column for the row below bluegill isn’t blank and contains another species? We can’t use our ‘bluegill + next row’ rule, so we have to incorporate further selection techniques to reach an unambiguous solution. Hopefully you can see how complexity quickly emerges from what initially seemed like a simple task.

The final code for the function incorporated eight different selection methods based on how the data were presented on Lakefinder. The code is pretty nasty-looking, so here’s a flowchart that shows what the function does.

my image

Here’s what it looks like to use the function in R:

#create string of lake ids to search<-c('27013700','82004600','82010400')

#create string of fish names to search
fish.names<-c('Bluegill','Largemouth Bass','Northern Pike')

#use the function, fish.str=fish.names, net.type='Trap', trace=T, clean=F)

There are five arguments for The first two are required and the last three have default values. ‘dows’ is a string of lake DOW numbers that identify the lakes you want to search (these numbers are available on the DNR Data Deli). ‘fish.str’ is a string of common fish names that will be searched. The ‘net.type’ is a string that specifies whether you want to search trap net (default) or gill net data (as ‘Trap’ or ‘Gill’). ‘trace’ is a logical argument that indicates if text showing progress will be returned as the function runs. Finally, ‘clean’ is a logical argument that indicates if the function will return output with only lake and fish data. I’ll explain in a bit why I included this last argument.

Now some details (skip if you’re bored)… the flowchart indicates that the fish names in ‘fish.str’ are searched for each lake in ‘dows’ using a nested loop with lake dow at the top level and fish name at the bottom level (denoted by the dotted line box). The tricky part comes when we have to deal with the different forms of the data I mentioned earlier. The flowchart has several hexagonal decision nodes that determine how the data are interpreted in the function. Each time the function reaches an unambiguous result, the yes/no decisions include an exception code (‘e1’ through ‘e8’) that specifies the form of the data that was found. Starting with the first exception (‘e1’), we can see what is meant by terminal. An ‘e1’ is returned while searching a lake DOW if the lake doesn’t exist on LakeFinder or if a lake exists but the survey isn’t available. If the lake and survey both exist, the function jumps into the loop through each species. The first check within the fish loop determines if the fish being searched is present in the survey. If not, ‘e2’ (second exception) is entered for the species and the loop continues to the next. Zero values are returned if a fish isn’t in a survey, otherwise NA is returned for no survey or no lake data. The function continues through several more decision nodes until a final result for each species in each lake is obtained.

Finally, I included a safety measure that allows the function to continue running if some unforeseen issue is encountered (‘massive failure’). This exception (‘e8’) will be returned if an error occurs at any point during the fish search. This only occurred once because someone had entered two species names for the same fish and same net type in the same lake (unresolved, unforeseen ambiguity). Its a good idea to include safety nets in your functions because its very difficult to be 100% certain the function captures all exceptions in the data. Rather than having the function crash, the error is noted and the function continues.

Alright, enough of that. Let’s see what the function returns. Using the above code, we get the catch for the three lakes based on the net type and species we wanted to find:, fish.str=fish.names, net.type='Trap', trace=T, clean=F)

27013700; 1 of 3 
82004600; 2 of 3 
82010400; 3 of 3 
Time difference of 1.368078 secs
      dows     fish fish.val result
1 27013700 Bluegill    51.22     e6
2 82004600 Bluegill     9.22     e6
3 82010400 Bluegill    43.56     e6

$`Largemouth Bass`
      dows            fish fish.val result
4 27013700 Largemouth Bass     1.44     e6
5 82004600 Largemouth Bass        0     e7
6 82010400 Largemouth Bass     0.44     e6

$`Northern Pike`
      dows          fish fish.val result
7 27013700 Northern Pike     0.44     e6
8 82004600 Northern Pike     1.33     e6
9 82010400 Northern Pike     2.11     e6

The function returns a list for each species with data frames in each element of the list. The columns of each data frame include the lake name, species (which is redundant), the catch for the specified net type (number of fish per net), and the final decision in the function that lead to the result. I included the ‘result’ column in the data so I could see if the function was actually doing what it was supposed to do. You get cleaner results if you set ‘clean’ as true, obviously. The output will have the result column and the redundant fish column in the data frames removed. Lakes not in Lakefinder or lakes without fish surveys will also be removed.

I’ve covered the nitty-gritty of what the function does and how it can be used. The data that can be gathered are of course more interesting than the function and I save a presentation of this material for my next blog. For now, here’s a teaser showing presence/absence of common carp using data obtained with the function. Common carp (not Asian) are ubiquitous in Minnesota and anglers may be interested in carp locations based on the impacts invasive fish have on native species. We can use the function to locate all the lakes surveyed by Minnesota DNR that contain common carp. Thanks to our data fishing we now know to avoid common carp by fishing up north. Stay tuned for the next installment!

Get the function here (requires XML package):

Breaking the rules with spatial correlation

Students in any basic statistics class are taught linear regression, which is one of the simplest forms of a statistical model. The basic idea is that a ‘response’ variable can be mathematically related to one or any number of ‘explanatory’ variables through a linear equation and a normally distributed error term. With any statistical tool, linear regression comes with a set of assumptions that have to be met in order for the model to be valid. These assumptions include1:

  • model residuals are individually normally distributed
  • constant variance or homogeneity (i.e., even spread of residuals)
  • explanatory variables are deterministic
  • independence of observations or no pattern in residuals

When I first learned about linear regression I was overly optimistic about how easy it was to satisfy each of these assumptions. Once I learned to be more careful I started using techniques to verify if each of these assumptions were met. However, no one bothered to tell me what actually happens when these assumptions are violated, at least not in a quantitative sense. Most textbooks will tell you that if your model violates these assumptions the statistics or coefficients describing relationships among variables will be biased. In other words, your model will tell you information that is systematically different from relationships describing the true population.

Of course the bias in your model depends on what rule you’ve violated and it would be impractical (for the purpose of this blog) to conduct a thorough analysis of each assumption. However, I’ve always wondered what would happen if I ignored the fourth point above regarding independence. I’ve had the most difficulty dealing with auto-correlation among observations and I’d like to get a better idea of how a model might be affected if this assumption is ignored. For example, how different would the model coefficients be if we created a regression model without accounting for a correlation structure in the data? In this blog I provide a trivial but potentially useful example to address this question. I start by simulating a dataset and show how the model estimates are affected by ignoring spatial dependency among observations. I also show how these dependencies can be addressed using an appropriate model. Before I continue, I have to give credit to the book ‘Mixed Effects Models and Extensions in Ecology with R’1, specifically chapters 2 and 7. The motivation for this blog comes primarily from these chapters and interested readers should consult them for more detail.

Let’s start by importing the packages gstat, sp, and nlme. The first two packages have some nifty functions to check model assumptions and the nlme package provides models that can be used to account for various correlation structures.

#import relevant packages

Next we create a simulated dataset, which allows us to verify the accuracy of our model if we know the ‘true’ parameter values. The dataset contains 400 observations that are each spatially referenced by the ‘lat’ and ‘lon’ objects. The coordinates were taken from a uniform distribution (from -4 to 4) using the runif function. There are two explanatory variables that are both taken from the standard normal distribution using the rnorm function. I’ve used the set.seed function so the results can be reproduced. R uses pseudo-random number generation, which can be handy for sharing examples.

#create simulated data, lat and lon from uniformly distributed variable
#exp1 and exp2 from random normal

The response variable is a linear combination of the explanatory variables and follows the form of a standard linear model. I’ve set the parameters for the linear model as 1, 4, -3, and -2 for the intercept and slope values. The error term is also normally distributed. Note that I’ve included latitude in the model to induce a spatial correlation structure.

#resp is linear combination of explanatory variables
#plus an error term that is normally distributed

Setting the parameter values is key because we hope that any model we use to relate the different variables will return values that are very close to the actual and show minimal variation. In the real world, we do not know the actual parameter values so we have to be really careful that we are using the models correctly.

Next, we can plot the variables and run some diagnostics to verify we’ve created appropriate data. Plotting the data frame produces scatterplots that indicate correlation among the simulated variables.

#pair plots of variables


The cor function verifies the info in the figure by returning Pearson correlation coefficients. Note the sign and magnitude of the correlations of the response variable with all other variables. The explanatory variables and latitude are not correlated with each other.

#correlation between variables

           resp       exp1        exp2         lat
resp  1.0000000 0.53416042 -0.44363892 -0.70131965
exp1  0.5341604 1.00000000  0.01703620  0.03419987
exp2 -0.4436389 0.01703620  1.00000000  0.04475536
lat  -0.7013196 0.03419987  0.04475536  1.00000000

Now that we’ve simulated our data, we can create a linear model to see if the parameter coefficients are similar to the actual. In this example, we’re ignoring the spatial correlation structure. If we were savvy modelers, we would check for spatial correlation prior to building our model. Spatial correlation structures in actual data may also be more complex than a simple latitudinal gradient, so we need to be extra careful when evaluating the independence assumption. We create the model using the basic lm function.


             Estimate Std. Error    t value     Pr(>|t|)
(Intercept)  1.298600  0.2523227   5.146587 4.180657e-07
exp1         3.831385  0.2533704  15.121673 4.027448e-41
exp2        -3.237224  0.2561525 -12.637879 5.276878e-31

The parameter estimates are not incredibly different form the actual values. Additionally, the standard errors of the estimates are not large enough to exclude the true parameter values. We can verify this using a t-test with the null hypothesis for each parameter set at the actual (the default null in lm is zero). I’ve created a function for doing this that takes the observed mean value as input rather than the normal t.test function that uses a numeric vector of values as input. The arguments for the custom function are the observed mean (, null mean (mu), standard error (se), degrees of freedom (deg.f), and return.t which returns the t-statistic if set as true.

#function for testing parameter values against actual
  if(return.t) return(round((,3))

#for first explanatory variable
t_test(3.8314, 4, 0.2534, 397)
[1] 0.5062122

The p-value indicates that the estimated parameter value for the first explanatory variable is not different from the actual. We get similar results if we test the other parameters. So far we’ve shown that if we ignore the spatial correlation structure we get fairly reasonable parameter estimates. Can we improve our model if we account for the spatial correlation structure?

Before we model spatial correlation we first have to identify its existence (remember, we’re assuming we don’t know the response is correlated with latitude). Let’s dig a little deeper by first describing some tools to evaluate the assumption of independence. The easiest way to evaluate this assumption in a linear model is to plot the model residuals by their spatial coordinates. We would expect to see some spatial pattern in the residuals if we have not accounted for spatial correlation. The bubble plot function in the sp package can be used to plot residuals.



A nice feature of the bubble function is that it color codes the residuals based on positive and negative values and sets the size of the points in proportion to the magnitude. An examination of the plot shows a clear residual pattern with latitude such that smaller values are seen towards the north.

The bubble plot provides a qualitative means of evaluating spatial correlation. A more quantitative approach is to evaluate semivariance, which provides a measure of spatial correlation between points at different distances. Points closer to one another are more likely to be similar if observations in our dataset are spatially correlated. The variety of statistics that can be applied to spatial data is staggering and evaluations of semivariance are no exception. For this example, it is sufficient to know that points closer to one another will have smaller values of semivariance (more similar) and those farther apart will have larger values (less similar) if points are spatially correlated. We would also expect calculated values of semivariance to vary depending on what direction we are measuring (i.e., anisotropy). For example, we might not detect spatial correlation if we estimate semivariance strictly in an east to west direction, whereas semivariance would vary quite dramatically along a north to south direction. We can apply the variogram function (gstat package) to estimate semivariance and also account for potential differences depending on direction.



When we use the variogram function we can specify directions in which semivariance is measured using the alpha argument. Here we’ve specified that we’d like to measure semivariance in the residuals in a northern, north-eastern, eastern, and south-eastern direction. This is the same as estimating semivariance in directions 180 degrees from what we’ve specified, so it would be redundant if we looked at semivariance in a southern, south-western, western, and north-western direction. The plot shows that spatial correlation is strongest in a north-south direction and weakest along an east-west direction, which confirms the latitudinal change in our response variable.

Our analyses have made it clear that the model does not account for spatial correlation, even though our parameter estimates are not significantly different from the actual values. If we account for this spatial correlation can we improve our model? An obvious solution would be to include latitude as an explanatory variable and obtain a parameter estimate that describes the relationship. Although this would account for spatial correlation, it would not provide any explanation for why our response variable changes across a latitudinal gradient. That is, we would not gain new insight into causation, rather we would only create a model with more precise parameters. If we wanted to account for spatial correlation by including an additional explanatory variable we would have to gather additional data based on whatever we hypothesized to influence our response variable. Temperature, for example, may be a useful variable because it would likely be correlated with latitude and its relationship with the response variable could be biologically justified.

In the absence of additional data we can use alternative models to account for spatial correlation. The gls function (nlme package) is similar to the lm function because it fits linear models, but it can also be used to accommodate errors that are correlated. The implementation is straightforward with the exception that you have specify a form of the spatial correlation structure. Identifying this structure is non-trivial and the form you pick will affect how well your model accounts for spatial correlation. There are six options for the spatial correlation structure. Picking the optimal correlation structure requires model comparison using AIC or similar tools.

For our example, I’ve shown how we can create a model using only the Gaussian correlation structure. Also note the form argument, which can be used to specify a covariate that captures the spatial correlation.

#refit model with correlation structure

                Value Std.Error   t-value p-value
(Intercept) 154.49320  393.3524   0.39276  0.6947
exp1          3.99840    0.0486  82.29242  0.0000
exp2         -3.01688    0.0491 -61.38167  0.0000

The gls function seems to have trouble estimating the intercept. If we ignore this detail, the obvious advantage of accounting for spatial correlation is a reduction in the standard error for the slope estimates. More precisely, we’ve reduced the uncertainty of our estimates by over 500% from those obtained using lm, even though our initial estimates were not significantly different from the actual.

The advantage of accounting for spatial correlation is a huge improvement in the precision of the parameter estimates. Additionally, we do not sacrifice degrees of freedom if we use the gls function to model spatial correlation. Hopefully this example has been useful for emphasizing the need to account for model assumptions. I invite further discussion on methods for modeling spatial correlation.

Get the code here:

1Zuur, A.F., Ieno, E.N., Walker, N.J., Saveliev, A.A., Smith, G.M., 2009. Mixed Effects Models and Extensions in Ecology with R. Springer, New York.