Stealing from the internet: Part 1

Well, not stealing but rather some handy tools for data mining… About a year ago I came across the package XML as I was struggling to get some data from various web pages. The purpose of this blog is to describe how this package can be used to quickly gather data from the internet. I'll first describe how the functions are used and then show how they can be included in a custom function to quickly 'steal' what we need.

I realize that data mining from internet sources is not a new concept by any means (a more thorough intro), but the practice is new to me and I have found it very useful in the applications I've attempted. For example, I routinely collect supporting data for large datasets that describe lakes in Minnesota. A really useful website that contains a lot of this information is LakeFinder hosted by the Minnesota Department of Natural Resources. The website can be used to access all sorts of information about a lake just by punching in a lake name or unique 8-digit id number. Check out Lake Minnetonka (remember Purple Rain??). The page has lots of information... lake characteristics, fish surveys, consumption warnings, etc. Also note the URL. The last 8 digits are the unique lake ID for Minnetonka assigned by MNDNR. This part of the URL comes into play later.

What if I want information for a couple dozen lakes or even several hundred (yea, or over 10,000)? We have a few options. Obviously we could go through lake by lake and copy the information by hand, but this causes headaches and eventually a desire to harm one's self. We could also contact the site administrator and batch request the data directly, but this may take some time or may require repeated requests depending on our needs. As you'll probably soon realize, I hate doing things that are inefficient and the XML package provides us with some useful tools to quickly gather information from the internet.

I'll walk through an example that shows how we can get maximum depth of the lake by accessing the HTML directly from the website. As a disclaimer, I have very little experience with HTML or any other markup languages (other than LaTeX) so I encourage feedback if the approach below can be implemented more efficiently.

#install and load packages
install.packages('XML')
library(XML)

The XML package has tons of functions and I'm not going to pretend like I understand them all. However, the htmlTreeParse function (or xmlTreeParse) can import raw HTML code from web pages and will be useful for our purposes. Let's import the HTML code for Lake Minnetonka (remember the last 8 digits of the URL describe the unique lake ID).

html.raw<-htmlTreeParse(
    'http://www.dnr.state.mn.us/lakefind/showreport.html?downum=27013300',
    useInternalNodes=T
    )
    
html.parse<-xpathApply(html.raw, "//p", xmlValue)

The html.raw object is not immediately useful because it literally contains all of the raw HTML for the entire webpage. We can parse the raw code using the xpathApply function which parses HTML based on the path argument, which in this case specifies parsing of HTML using the paragraph tag.

We now have a list of R objects once we use xpathApply, so we don't have to mess with HTML/XML anymore. The trick is to parse the text in this list even further to find what we want. If we go back to the Lake Minnetonka page, we can see that 'Maximum Depth (ft)' precedes the listed depth for Lake Minnetonka. We can use this text to find the appropriate list element in html.parse that contains the depth data using the grep function from the base package.

robj.parse<-grep('*Depth*',unlist(html.parse),value=T)

It's not a huge surprise that we can get a return from grep for more than one element that contains 'depth'. We'll need to select the correct element that has the depth information we want (in this case, the first element) and further parse the string using the strsplit function.

robj.parse<-robj.parse[[1]] #select first element

depth.parse<-as.numeric(strsplit(strsplit(robj.parse,'ft): ')[[1]][2],'Water')[[1]][1])

The code for depth.parse is really messy but all it does is make two calls to strsplit to grab the depth value based on the text that is directly before and after the info we need 'ft): ' and 'Water', respectively). The final value is converted from a text to numeric object. Seasoned programmers will probably cringe at this code since it will not return the correct value if the web site changes in any way. Yeah, this isn't the most orthodox way of coding but it works for what we need. Undoubtedly there are more robust ways of getting this information but this works just fine for static websites.

Additionally, we can combine all the code from above into a function that parses everything at once.

depth.fun<-function(lake){

  url.in<-paste(
    'http://www.dnr.state.mn.us/lakefind/showreport.html?downum',
    lake,
    sep='='
    )
    
  html.raw<-htmlTreeParse(url.in,useInternalNodes=T)
      
  html.parse<-xpathApply(html.raw, path="//p", fun=xmlValue)
  
  robj.parse<-grep('*Depth*',unlist(html.parse),value=T)
  
  depth.parse<-as.numeric(
    strsplit(strsplit(robj.parse,'ft): ')[[1]][2],'Water')[[1]][1]
    )
  
  return(depth.parse)
  
  }

depth.fun('27013300')

All we do now is put in the 8-digit lake identifier (as a character string) and out comes the depth. We can make repeated calls to the function to get data for any lake we want, so long as we know the 8-digit identifier. The lake id number is critical because this defines where the raw HTML comes from (i.e., what URL is accessed). Notice that the first part of depth.fun pastes the input id text with the URL, which is then passed to later functions.

Here's an example getting the information for several lakes using sapply to make repeated calls to depth.fun.

lake.ids<-c('27013700','82004600','82010400')
sapply(lake.ids,depth.fun)

It's easy to see how we can use the function to get data for lots of lakes at the same time, although this example is trivial if you don't care about lakes. However, I think the code is useful if you ever encounter a situation where data are stored online with predictable URLs with predictable text strings surrounding the data you need. I've already used variants of this code to get other data (part 2 on the way). The only modification required to make the function useful for gathering other data is changing the URL and whatever text needs to be parsed to get what you need. So there you have it.

Get all the code here:

About this blog

Greetings and welcome to my blog!

My name is Marcus and I’m a sixth year graduate student and 4th year PhD student in the Conservation Biology Graduate Program at the University of Minnesota, Twin Cities campus. My research focuses on the development and implementation of biological monitoring techniques for Minnesota’s 11,842 lakes. I work with hundreds of biological surveys that have graciously been provided to me by the Minnesota Department of Natural Resources. Not surprisingly, I have spent the last six years of my life confined to 161 Alderman Hall wading through piles of data. I rarely see the sun and I spend a lot of time tinkering around with R (http://www.r-project.org/).

Fellow classmates have suggested that I have something to offer regarding my knowledge with R. Whether or not that is true, I have started this blog as a means to communicate some useful tips and tricks I have learned in the five years I have been using R. I have accumulated literally hundreds of .R files on my hard-drive, some of which have been useful in my own research and some of which were never useful to me but may be useful to others. I hope to share some of these tidbits through this blog. I will never post data from the DNR (although this is public property) but will post script I have developed to explore these data that likely have use in other contexts. I also work extensively with ArcMap and Python and may post some GIS-related topics from time to time. My latest obsession has focused on reproducible research using Sweave and LaTeX, so expect some posts regarding these topics. I also place a lot of importance on how results are displayed in figures and tables. I spend an absurd amount of time creating pretty pictures so expect a few posts related to R graphics (and ‘easy’ table creation using the xtable package).

I have no official training in computer science despite what some may view as an irrational obsession with R. It’s an absolute miracle that I don’t have carpal tunnel and am not morbidly obese. Most, if not all, of my fellow classmates would be disgusted at the amount of time I spend on the computer. Most students in my field thrive off of field work and, in my opinion, do not have as much opportunities as I have had to develop their computational skills. I have found that many young and talented graduate students in the ecological sciences have skills in other areas that far exceed mine, but they lack adequate training to help them tackle the beast that is R. I’m approaching this blog with the philosophy that the skills I have learned are the product of countless hours in front of a computer, while others have been learning a different set of skills outside of the office. My hope is that the information I present here will assist others in their own research but may also be helpful to individuals in other disciplines.

I encourage productive discussion and criticisms of my posts. I hope that this information is helpful in some way, regardless of your background or expertise.

Cheers,

Marcus