# Average dissertation and thesis length, take two

About a year ago I wrote a post describing average length of dissertations at the University of Minnesota. I've been meaning to expand that post by adding data from masters theses since the methods for gathering/parsing the records are transferable. This post provides some graphics and links to R code for evaluating dissertation (doctorate) and thesis (masters) data from an online database at the University of Minnesota. In addition to describing data from masters theses, I've collected the most recent data on dissertations to provide an update on my previous post. I've avoided presenting the R code for brevity, but I invite interested readers to have a look at my Github repository where all source code and data are stored. Also, please, please, please note that I've since tried to explain that dissertation length is a pretty pointless metric of quality (also noted here), so interpret the data only in the context that they’re potentially descriptive of the nature of each major.

Feel free to fork/clone the repository to recreate the plots. The parsed data for theses and dissertations are saved as .RData files, 'thes_parse.RData' and 'diss_parse.RData', respectively. Plots were created in 'thes_plot.r' and 'diss_plot.r'. The plots comparing the two were created in 'all_plo.r'. To briefly summarize, the dissertation data includes 3037 records from 2006 to present. This differs from my previous blog by including all majors with at least five records, in addition to the most current data. The masters thesis data contains 930 records from 2009 to present. You can get an idea of the relative page ranges for each by taking a look at the plots. I've truncated all plots to maximum page ranges of 500 and 250 for the dissertation and thesis data, as only a handful of records exceeded these values. I'm not sure if these extremes are actually real data or entered in error, and to be honest, I'm too lazy to verify them myself. Just be cautious that there are some errors in the data and all plots are for informational purposes only, as they say…

-Marcus

# A simple workflow for using R with Microsoft Office products

The challenge of integrating Microsoft products with R software has been an outstanding issue for several years. Reasons for these issues are complicated and related to fundamental differences in developing proprietary vs open-source products. To date, I don’t believe there has been a satisfactory solution but I present this blog as my attempt to work around at least some of the issues using the two. As a regular contributor to R-bloggers, I stress that one should use MS products as little as possible given the many issues that have been described (for example, here, here, and here). It’s not my intent to pick on Microsoft. In fact, I think Excel is a rather nifty program that has its place in specific situations. However, most of my work is not conducive to the point-and-click style of spreadsheet analysis and the surprising limited number of operations available in Excel prevent all but the simplest analyses. I try my best to keep my work within the confines of RStudio, given its integration with multiple document preparation systems.

I work with several talented researchers that have different philosophies than my own on the use of Microsoft products. It’s inevitable that we’re occasionally at odds. Our difficulties go both directions — my insistence on using pdfs for creating reports or manuscripts and the other party’s inclination towards the spreadsheet style of analysis. It seems silly that we’re limited by the types of medium we prefer. I’ve recently been interested in developing a workflow that addresses some of the issues of using end-products from different sources under the notion of reproducibility. To this end, I used Pandoc and relevant R packages (namely gdata and knitr) to develop a stand-alone workflow that allows integration of Microsoft products with my existing workflows. The idea is simple. I want to import data sent to me in .xlsx format, conduct the analysis and report generation entirely within RStudio, and convert the output to .docx format on completion. This workflow allows all tasks to be completed within RStudio, provided the supporting documents, software, and packages work correctly.

Of course, I don’t propose this workflow as a solution to all issues related to Office products and R. I present this material as a conceptual and functional design that could be used by others with similar ideas. I’m quite happy with this workflow for my personal needs, although I’m sure it could be improved upon. I describe this workflow using the pdf below and provide all supporting files on Github: https://github.com/fawda123/pan_flow.

\documentclass[xcolor=svgnames]{beamer}
\usetheme{Boadilla}
\usecolortheme[named=SeaGreen]{structure}
\usepackage{graphicx}
\usepackage{breqn}
\usepackage{xcolor}
\usepackage{booktabs}
\usepackage{verbatim}
\usepackage{tikz}
\usetikzlibrary{shadows,arrows,positioning}
\definecolor{links}{HTML}{2A1B81}
\hypersetup{colorlinks,linkcolor=links,urlcolor=links}
\usepackage{pgfpages}

\tikzstyle{block} = [rectangle, draw, text width=9em, text centered, rounded corners, minimum height=3em, minimum width=7em, top color = white, bottom color=brown!30,  drop shadow]

\newcommand{\ShowSexpr}[1]{\texttt{{\char\\}Sexpr\{#1\}}}

\begin{document}

\title[R with Microsoft]{A simple workflow for using R with Microsoft products}
\author[M. Beck]{Marcus W. Beck}

\institute[USEPA NHEERL]{USEPA NHEERL Gulf Ecology Division, Gulf Breeze, FL\\
Email: \href{mailto:beck.marcus@epa.gov}{beck.marcus@epa.gov}, Phone: 850 934 2480}

\date{May 21, 2014}

%%%%%%
\begin{frame}
\vspace{-0.3in}
\titlepage
\end{frame}

%%%%%%
\begin{frame}{The problem...}
\begin{itemize}
\item R is great and has an increasing user base\\~\\
\item RStudio is integrated with multiple document preparation systems \\~\\
\item Output documents are not in a format that facilitates collaboration with
non R users, e.g., pdf, html \\~\\
\item Data coming to you may be in a proprietary format, e.g., xls spreadsheet
\end{itemize}
\end{frame}

%%%%%%
\begin{frame}{The solution?}
\begin{itemize}
\item Solution one - Make liberal use of projects' within RStudio \\~\\
\item Solution two - Use \texttt{gdata} package to import excel data \\~\\
\item Solution three - Get pandoc to convert document formats - \href{http://johnmacfarlane.net/pandoc/}{http://johnmacfarlane.net/pandoc/} \\~\\
\end{itemize}
\onslide<2->
\large
\centerline{\textit{Not recommended for simple tasks unless you really, really love R}}
\end{frame}

%%%%%
\begin{frame}{An example workflow}
\begin{itemize}
\item I will present a workflow for integrating Microsoft products within RStudio as an approach to working with non R users \\~\\
\item Idea is to never leave the RStudio environment - dynamic documents! \\~\\
\item General workflow... \\~\\
\end{itemize}
\small
\begin{center}
\begin{tikzpicture}[node distance=2.5cm, auto, >=stealth]
\onslide<2->{
\node[block] (a) {1. Install necessary software and packages};}
\onslide<3->{
\node[block] (b)  [right of=a, node distance=4.2cm] {2. Create project in RStudio};
\draw[->] (a) -- (b);}
\onslide<4->{
\node[block] (c)  [right of=b, node distance=4.2cm]  {3. Setup supporting docs/functions};
\draw[->] (b) -- (c);}
\onslide<5->{
\node[block] (d)  [below of=a, node distance=2.5cm]  {4. Import with \texttt{gdata}, summarize};
\draw[->] (c) -- (d);}
\onslide<6->{
\node[block] (e)  [right of=d, node distance=4.2cm]  {5. Create HTML document using knitr Markdown};
\draw[->] (d) -- (e);}
\onslide<7->{
\node[block] (f)  [right of=e, node distance=4.2cm]  {6. Convert HTML doc to Word with Pandoc};
\draw[->] (e) -- (f);}
\end{tikzpicture}
\end{center}
\end{frame}

%%%%%%
\begin{frame}[shrink]{The example}
You are sent an Excel file of data to summarize and report but you love R and want to do everything in RStudio...
<<echo = F, results = 'asis', message = F>>=
library(gdata)
library(xtable)

prl_pth <- 'C:/strawberry/perl/bin/perl.exe'
url <- 'http://beckmw.files.wordpress.com/2014/05/my_data.xlsx'
dat <- read.xls(xls = url, sheet = 'Sheet1', perl = prl_pth)
out.tab <- xtable(dat, digits=4)
print.xtable(out.tab, type = 'latex', include.rownames = F,
size = 'scriptsize')
@
\end{frame}

%%%%%%
\begin{frame}{Step 1}
Install necessary software and Packages \\~\\
\onslide<1->
\begin{itemize}
\onslide<2->
\item R and RStudio (can do with other R editors)\\~\\
\item Microsoft Office\\~\\
\onslide<3->
\item Strawberry Perl for using \texttt{gdata} package\\~\\
\item Pandoc\\~\\
\onslide<4->
\item Packages: \texttt{gdata}, \texttt{knitr}, \texttt{utils}, \texttt{xtable}, others as needed...
\end{itemize}
\end{frame}

%%%%%%
\begin{frame}{Step 2}
Create a project in RStudio \\~\\
\begin{itemize}
\item Create a folder or use existing on local machine \\~\\
\item Add .Rprofile file to the folder for custom startup \\~\\
\item Move all data you are working with to the folder \\~\\
\item Literally create project in RStudio \\~\\
\item Set options within RStudio \\~\\
\end{itemize}
\end{frame}

%%%%%%
\begin{frame}[fragile]{Step 3}
Setup supporting docs/functions, i.e., .Rprofile, functions, report, master
\scriptsize
\begin{block}{.Rprofile}
<<echo = T, eval = F, results = 'markup'>>=
# library path
.libPaths('C:\\Users\\mbeck\\R\\library')

# startup message
cat('My project...\n')

# packages to use
library(utils) # for system commands
library(knitr) # for markdown
library(gdata) # for import xls
library(reshape2) # data format conversion
library(xtable) # easy tables
library(ggplot2) # plotting

# perl path for gdata
prl_pth <- 'C:/strawberry/perl/bin/perl.exe'

# functions to use
source('my_funcs.r')
@
\end{block}
\end{frame}

%%%%%%
\begin{frame}[t, fragile]{Step 3}
Setup supporting docs/functions, i.e., .Rprofile, functions, report, master
\scriptsize
\begin{block}{my\_funcs.r}
<<echo = T, eval = F, results = 'markup'>>=
######
# functions for creating report,
# created May 2014, M. Beck

######
# processes data for creating output in report,
# 'dat_in' is input data as data frame,
# output is data frame with converted variables
proc_fun<-function(dat_in){

# convert temp to C
dat_in$Temperature <- round((dat_in$Temperature - 32) * 5/9)

#  convert data to long format
dat_in <- melt(dat_in, measure.vars = c('Restoration', 'Reference'))

return(dat_in)

}

######
# creates linear model for data,
# 'proc_dat' is processed data returned from 'proc_fun',
# output is linear model object
mod_fun <- function(proc_in) lm(value ~ variable + Year, dat = proc_in)
@
\end{block}
\end{frame}

%%%%%%
\begin{frame}[fragile,shrink]{Step 3}
Setup supporting docs/functions, i.e., .Rprofile, functions, report, master
\scriptsize
\begin{block}{report.Rmd}
\begin{verbatim}
======================
Here's a report I made for r gsub('/|.xlsx','',name)
----------------------

{r echo=F, include=F}
# import data
url <- paste0('http://beckmw.files.wordpress.com/2014/05', name)
dat <- read.xls(xls = url, sheet = 'Sheet1', perl = prl_pth)

# process data for tables/figs
dat <- proc_fun(dat)

# model of data
mod <- mod_fun(dat)


### Model summary
{r results='asis', echo=F}
print.xtable(xtable(mod, digits = 2), type = 'html')


### Figure of restoration and reference by year
{r reg_fig, echo = F, fig.width = 5, fig.height = 3, dpi=200}
ggplot(dat, aes(x = Year, y = value, colour = variable)) +
geom_point() +
stat_smooth(method = 'lm')

\end{verbatim}
\end{block}
\end{frame}

%%%%%%
\begin{frame}[t, fragile]{Step 3}
Setup supporting docs/functions, i.e., .Rprofile, functions, report, master
\scriptsize
\begin{block}{master.r}
<<echo = T, eval = F, results = 'markup'>>=
# file to process
name <- '/my_data.xlsx'

# rmd to html
knit2html('report.Rmd')

# pandoc conversion of html to word doc
system(paste0('pandoc -o report.docx report.html'))
@
\end{block}
\end{frame}

%%%%%%
\begin{frame}[fragile]{Steps 4 - 6}
\small
After creating supporting documents in Project directory, final steps are completed by running master.r'
\begin{itemize}
\item Step 4 - xls file imported using \texttt{gdata} package, implemented in report.Rmd'
\item Step 5 - HTML document created by converting report.Rmd' with \texttt{knit2html} in master.r'
\item Step 6 - HTML document converted to Word with Pandoc by invoking system command
\end{itemize}
\begin{block}{master.r}
<<echo = T, eval = F, results = 'markup'>>=
# file to process
name <- '/my_data.xlsx'

# rmd to html
knit2html('report.Rmd')

# pandoc conversion of html to word doc
system(paste0('pandoc -o report.docx report.html'))
@
\end{block}
\end{frame}

\end{document}


To use the workflow, start a new version control project through Git in RStudio, pull the files from the repository, and run the master file. An excellent introduction for using RStudio with Github can be found here. I’ve also included two excel files that can be used to generate the reports. You can try using each one by changing the name variable in the master file and then running the commands:

name <- 'my_data.xlsx'
knit2html('report.Rmd')
system(paste0('pandoc -o report.docx report.html'))


or…

name <- 'my_data_2.xlsx'
knit2html('report.Rmd')
system(paste0('pandoc -o report.docx report.html'))


The output .docx file should be different depending on which Excel file you use as input. As the pdf describes, none of this will work if you don’t have the required software/packages, i.e., R/RStudio, Strawberry Perl, Pandoc, MS Office, knitr, gdata, etc. You’ll also need Git installed if you are pulling the files for local use (again, see here). I’d be interested to hear if anyone finds this useful or any general comments on improvements/suggestions for the workflow.

Cheers,

Marcus

# How much code have you written?

This past week I attended the National Water Quality Monitoring Conference in Cincinnati. Aside from spending my time attending talks, workshops, and meeting like-minded individuals, I spent an unhealthy amount of time in the hotel bar working on this blog post. My past experiences mixing coding and beer have suggested the two don’t mix, but I was partly successful in writing a function that will be the focus of this post.

I’ve often been curious how much code I’ve written over the years since most of my professional career has centered around using R in one form or another. In the name of ridiculous self-serving questions, I wrote a function for quantifying code lengths by file type. I would describe myself as somewhat of a hoarder with my code in that nothing ever gets deleted. Getting an idea of the total amount was a simple exercise in finding the files, enumerating the file contents, and displaying the results in a sensible manner.

I was not surprised that several functions in R already exist for searching for file paths in directory trees. The list.files function can be used to locate files using regular expression matching, whereas the file.info function can be used to get descriptive information for each file. I used both in my function to find files in a directory tree through recursive searching of paths with a given extension name. The date the files was last modified is saved, and the file length, as lines or number of characters, is saved after reading the file with readLines. The output is a data frame for each file with the file name, type, length, cumulative length by file type, and date. The results can be easily plotted, as shown below.

The function, obtained here, has the following arguments:

 root Character string of root directory to search file_typs Character vector of file types to search, file types must be compatible with readLines omit_blank Logical indicating of blank lines are counted, default TRUE recursive Logical indicating if all directories within root are searched, default TRUE lns Logical indicating if lines in each file are counted, default TRUE, otherwise characters are counted trace Logical for monitoring progress, default TRUE

Here’s an example using the function to search a local path on my computer.

# import function from Github
library(devtools)

# https://gist.github.com/fawda123/20688ace86604259de4e
source_gist('20688ace86604259de4e')

# path to search and file types
root <- 'C:/Projects'
file_typs <- c('r','py', 'tex', 'rnw')

# get data from function
my_fls <- file.lens(root, file_typs)

head(my_fls)

##                               fl Length       Date cum_len Type
## 1                 buffer loop.py     29 2010-08-12      29   py
## 2                  erase loop.py     22 2010-08-12      51   py
## 3 remove selection and rename.py     26 2010-08-16      77   py
## 4              composite loop.py     32 2010-08-18     109   py
## 5                extract loop.py     61 2010-08-18     170   py
## 6         classification loop.py     32 2010-08-19     202   py


In this example, I’ve searched for R, Python, LaTeX, and Sweave files in the directory ‘C:/Projects/’. The output from the function is shown using the head command.

Here’s some code for plotting the data. I’ve created four plots with ggplot and combined them using grid.arrange from the gridExtra package. The first plot shows the number of files by type, the second shows file length by date and type, the third shows a frequency distribution of file lengths by type, and the fourth shows a cumulative distribution of file lengths by type and date.

# plots
library(ggplot2)
library(gridExtra)

# number of files by type
p1 <- ggplot(my_fls, aes(x = Type, fill = Type)) +
geom_bar() +
ylab('Number of files') +
theme_bw()

# file length by type and date
p2 <- ggplot(my_fls, aes(x = Date, y = Length, group = Type,
colour = Type)) +
geom_line() +
ylab('File length') +
geom_point() +
theme_bw() +
theme(legend.position = 'none')

# density of file length by type
p3 <- ggplot(my_fls, aes(x = Length, y = ..scaled.., group = Type,
colour = Type, fill = Type)) +
geom_density(alpha = 0.25, size = 1) +
xlab('File length') +
ylab('Density (scaled)') +
theme_bw() +
theme(legend.position = 'none')

# cumulative length by file type and date
p4 <- ggplot(my_fls, aes(x = Date, y = cum_len, group = Type,
colour = Type)) +
geom_line() +
geom_point() +
ylab('Cumulative file length') +
theme_bw() +
theme(legend.position = 'none')

# function for common legend
# https://github.com/hadley/ggplot2/wiki/Share-a-legend-between-two-ggplot2-graphs
g_legend<-function(a.gplot){
tmp <- ggplot_gtable(ggplot_build(a.gplot))
leg <- which(sapply(tmp$grobs, function(x) x$name) == "guide-box")
legend <- tmp$grobs[[leg]] return(legend)} # get common legend, remove from p1 mylegend <- g_legend(p1) p1 <- p1 + theme(legend.position = 'none') # final plot grid.arrange( arrangeGrob(p1, p2, p3, p4, ncol = 2), mylegend, ncol = 2, widths = c(10,1))  Clearly, most of my work has been done in R, with most files being less than 200-300 lines. There seems to be a lull of activity in Mid 2013 after I finished my dissertation, which is entirely expected. I was surprised to see that the Sweave (.rnw) and LaTeX files weren’t longer until I remembered that paragraphs in these files are interpreted as single lines of text. I re-ran the function using characters as my unit of my measurement. # get file lengths by character my_fls <- file.lens(root, file_typs, lns = F) # re-run plot functions above  Now there are clear differences in lengths for the Sweave and LaTeX files, with the longest file topping out at 181256 characters. I know others might be curious to see how much code they’ve written so feel free to use/modify the function as needed. These figures represent all of my work, fruitful or not, in six years of graduate school. It goes without saying that all of your code has to be in the root directory. The totals will obviously be underestimates if you have code elsewhere, such as online. The function could be modified for online sources but I think I’m done for now. Cheers, Marcus # Process and observation uncertainty explained with R Once upon a time I had grand ambitions of writing blog posts outlining all of the examples in the Ecological Detective.1 A few years ago I participated in a graduate seminar series where we went through many of the examples in this book. I am not a population biologist by trade but many of the concepts were useful for not only helping me better understand core concepts of statistical modelling, but also for developing an appreciation of the limits of your data. Part of this appreciation stems from understanding sources and causes of uncertainty in estimates. Perhaps in the future I will focus more blog topics on other examples from The Ecological Detective, but for now I’d like to discuss an example that has recently been of interest in my own research. Over the past few months I have been working with some colleagues to evaluate statistical power of biological indicators. These analyses are meant to describe the certainty within which a given level of change in an indicator is expected over a period of time. For example, what is the likelihood of detecting a 50% decline over twenty years considering that our estimate of the indicators are influenced by uncertainty? We need reliable estimates of the uncertainty to answer these types of questions and it is often useful to categorize sources of variation. Hilborn and Mangel describe process and observation uncertainty as two primary categories of noise in a data measurement. Process uncertainty describes noise related to actual or real variation in a measurement that a model does not describe. For example, a model might describe response of an indicator to changing pollutant loads but we lack an idea of seasonal variation that occurs naturally over time. Observation uncertainty is often called sampling uncertainty and describes our ability to obtain a precise data measurement. This is a common source of uncertainty in ecological data where precision of repeated surveys may be affected by several factors, such as skill level of the field crew, precision of sampling devices, and location of survey points. The effects of process and observation uncertainty on data measurements are additive such that the magnitude of both can be separately estimated. The example I’ll focus on is described on pages 59–61 (the theory) and 90–92 (an example with pseudocode) in The Ecological Detective. This example describes an approach for conceptualizing the effects of uncertainty on model estimates, as opposed to methods for quantifying uncertainty from actual data. For the most part, this blog is an exact replica of the example, although I have tried to include some additional explanation where I had difficulty understanding some of the concepts. Of course, I’ll also include R code since that’s the primary motivation for my blog. We start with a basic population model that describes population change over time. This is a theoretical model that, in practice, should describe some actual population, but is very simple for the purpose of learning about sources of uncertainty. From this basic model, we simulate sources of uncertainty to get an idea of their exact influence on our data measurements. The basic model without imposing uncertainty is as follows: $\displaystyle N_{t+1}=sN_t + b_t$ where the population at time $t + 1$ is equal to the population at time $t$ multiplied by the survival probability $s$ plus the number of births at time $t$. We call this the process model because it’s meant to describe an actual population process, i.e., population growth over time given birth and survival. We can easily create a function to model this process over a time series. As in the book example, we’ll use a starting population of fifty individuals, add 20 individuals from births at each time step, and use an 80% survival rate. # simple pop model proc_mod <- function(N_0 = 50, b = 20, s = 0.8, t = 50){ N_out <- numeric(length = t) N_out[1] <- N_0 for(step in 1:t) N_out[step + 1] <- s*N_out[step] + b out <- data.frame(steps = 1:t, Pop = N_out[-1]) return(out) } est <- proc_mod()  The model is pretty straightforward. A for loop is used to estimate the population for time steps one to fifty with a starting population size of fifty at time zero. Each time step multiplies the population estimate from the previous time step and adds twenty new individuals. You may notice that the function could easily be vectorized, but I’ve used a for loop to account for sources of uncertainty that are dependent on previous values in the time series. This will be explained below but for now the model only describes the actual process. The results are assigned to the est object and then plotted. library(ggplot2) ggplot(est, aes(steps, Pop)) + geom_point() + theme_bw() + ggtitle('N_0 = 50, s = 0.8, b = 20\n')  In a world with absolute certainty, an actual population would follow this trend if our model accurately described the birth and survivorship rates. Suppose our model provided an incomplete description of the population. Hilborn and Mangel (p. 59) suggest that birth rates, for example, may fluctuate from year to year. This fluctuation is not captured by our model and represents a source of process uncertainty, or uncertainty caused by an incomplete description of the process. We can assume that the effect of this process uncertainty is additive to the population estimate at each time step: $\displaystyle N_{t+1}=sN_t + b_t + W_t$ where the model remains the same but we’ve included an additional term, $W_t$, to account for uncertainty. This uncertainty is random in the sense that we don’t know exactly how it will influence our estimate but we can describe it as a random variable from a known distribution. Suppose we expect random variation in birth rates for each time step to be normally distributed with mean zero and a given standard deviation. Population size at $t+1$ is the survivorship of the population at time $t$ plus the births accounting for random variation. An important point is that the random variation is additive throughout the time series. That is, if more births were observed for a given year due to random chance, the population would be larger the next year such that additional random variation at $t+1$ is added to the larger population. This is why a for loop is used because we can’t simulate uncertainty by adding a random vector all at once to the population estimates. The original model is modified to include process uncertainty. # simple pop model with process uncertainty proc_mod2 <- function(N_0 = 50, b = 20, s = 0.8, t = 50, sig_w = 5){ N_out <- numeric(length = t + 1) N_out[1] <- N_0 sig_w <- rnorm(t, 0, sig_w) for(step in 1:t) N_out[step + 1] <- s*N_out[step] + b + sig_w[step] out <- data.frame(steps = 1:t, Pop = N_out[-1]) return(out) } set.seed(2) est2 <- proc_mod2() # plot the estimates ggt <- paste0('N_0 = 50, s = 0.8, b = 20, sig_w = ',formals(proc_mod)$sig_w,'\n')
ggplot(est2, aes(steps, Pop)) +
geom_point() +
theme_bw() +
ggtitle(ggt)


We see considerable variation from the original model now that we’ve included process uncertainty. Note that the process uncertainty in each estimate is dependent on the estimate prior, as described above. This creates uncertainty that, although random, follows a pattern throughout the time series. We can look at an auto-correlation plot of the new estimates minus the actual population values to get an idea of this pattern. Observations that are closer to one another in the time series are correlated, as expected.

Adding observation uncertainty is simpler in that the effect is not propagated throughout the time steps. Rather, the uncertainty is added after the time series is generated. This makes intuitive because the observation uncertainty describes sampling error. For example, if we have an instrument malfunction one year that creates an unreliable estimate we can fix the instrument to get a more accurate reading the next year. However, suppose we have a new field crew the following year that contributes to uncertainty (e.g., wrong species identification). This uncertainty is not related to the year prior. Computationally, the model is as follows:

$\displaystyle N_{t+1}=sN_t + b_t$

$\displaystyle N^{*} = N + V$

where the model is identical to the deterministic model with the addition of observation uncertainty $V$ after the time series is calculated for fifty time steps. $N$ is the population estimate for the whole time series and $N^{*}$ is the estimate including observation uncertainty. We can simulate observation uncertainty using a random normal variable with assumed standard deviation as we did with process uncertainty, e.g., $V$ has length fifty with mean zero and standard deviation equal to five.

# model with observation uncertainty
proc_mod3 <- function(N_0 = 50, b = 20, s = 0.8, t = 50, sig_v = 5){

N_out <- numeric(length = t)
N_out[1] <- N_0

sig_v <- rnorm(t, 0, sig_v)

for(step in 1:t)
N_out[step + 1] <- s*N_out[step] + b

N_out <- N_out + c(NA,sig_v)

out <- data.frame(steps = 1:t, Pop = N_out[-1])

return(out)

}

# get estimates
set.seed(3)
est3 <- proc_mod3()

# plot
ggt <- paste0('N_0 = 50, s = 0.8, b = 20, sig_v = ',
formals(proc_mod3)$sig_v,'\n') ggplot(est3, aes(steps, Pop)) + geom_point() + theme_bw() + ggtitle(ggt)  We can confirm that the observations are not correlated between the time steps, unlike the model with process uncertainty. Now we can create a model that includes both process and observation uncertainty by combining the above functions. The function is slightly tweaked to return include a data frame with all estimates: process model only, process model with process uncertainty, process model with observation uncertainty, process model with process and observation uncertainty. # combined function proc_mod_all <- function(N_0 = 50, b = 20, s = 0.8, t = 50, sig_w = 5, sig_v = 5){ N_out <- matrix(NA, ncol = 4, nrow = t + 1) N_out[1,] <- N_0 sig_w <- rnorm(t, 0, sig_w) sig_v <- rnorm(t, 0, sig_v) for(step in 1:t){ N_out[step + 1, 1] <- s*N_out[step] + b N_out[step + 1, 2] <- N_out[step, 1] + sig_w[step] } N_out[1:t + 1, 3] <- N_out[1:t + 1, 1] + sig_v N_out[1:t + 1, 4] <- N_out[1:t + 1, 2] + sig_v out <- data.frame(1:t,N_out[-1,]) names(out) <- c('steps', 'mod_act', 'mod_proc', 'mod_obs', 'mod_all') return(out) } # create data set.seed(2) est_all <- proc_mod_all() # plot the data library(reshape2) to_plo <- melt(est_all, id.var = 'steps') # re-assign factor labels for plotting to_plo$variable <- factor(to_plo$variable, levels = levels(to_plo$variable),
labels = c('Actual','Pro','Obs','Pro + Obs'))

ggplot(to_plo, aes(steps, value)) +
geom_point() +
facet_wrap(~variable) +
ylab('Pop. estimate') +
theme_bw()


On the surface, the separate effects of process and observation uncertainty on the estimates is similar, whereas the effects of adding both maximizes the overall uncertainty. We can quantify the extent to which the sources of uncertainty influence the estimates by comparing observations at time $t$ to observations at $t - 1$. In other words, we can quantify the variance for each model by regressing observations separated by one time lag. We would expect the model that includes both sources of uncertainty to have the highest variance.

# comparison of mods
# create vectors for pop estimates at time t (t_1) and t - 1 (t_0)
t_1 <- est_all[2:nrow(est_all),-1]
t_1 <- melt(t_1, value.name = 'val_1')
t_0 <- est_all[1:(nrow(est_all)-1),-1]
t_0 <- melt(t_0, value.name = 'val_0')

#combine for plotting
to_plo2 <- cbind(t_0,t_1[,!names(t_1) %in% 'variable',drop = F])
head(to_plo2)
##   variable   val_0    val_1
## 1  mod_act 60.0000 68.00000
## 2  mod_act 68.0000 74.40000
## 3  mod_act 74.4000 79.52000
## 4  mod_act 79.5200 83.61600
## 5  mod_act 83.6160 86.89280
## 6  mod_act 86.8928 89.51424

# re-assign factor labels for plotting
to_plo2$variable <- factor(to_plo2$variable, levels = levels(to_plo2$variable), labels = c('Actual','Pro','Obs','Pro + Obs')) # we don't want to plot the first process model sub_dat <- to_plo2$variable == 'Actual'
ggplot(to_plo2[!sub_dat,], aes(val_0, val_1)) +
geom_point() +
facet_wrap(~variable) +
theme_bw() +
scale_y_continuous('Population size at time t') +
scale_x_continuous('Population size at time t - 1') +
geom_abline(slope = 0.8, intercept = 20)


A tabular comparison of the regressions for each plot provides a quantitative measure of the effect of uncertainty on the model estimates.

library(stargazer)
mods <- lapply(
split(to_plo2,to_plo2$variable), function(x) lm(val_1~val_0, data = x) ) stargazer(mods, omit.stat = 'f', title = 'Regression of population estimates at time$t$against time$t - 1$for each process model. Each model except the first simulates different sources of uncertainty.', column.labels = c('Actual','Pro','Obs','Pro + Obs'), model.numbers = F)  The table tells us exactly what we would expect. Based on the r-squared values, adding more uncertainty decreases the explained variance of the models. Also note the changes in the parameter estimates. The actual model provides slope and intercept estimates identical to those we specified in the beginning ($s = 0.8$ and $b = 20$). Adding more uncertainty to each model contributes to uncertainty in the parameter estimates such that survivorship is under-estimated and birth contributions are over-estimated. It’s nice to use an arbitrary model where we can simulate effects of uncertainty, unlike situations with actual data where sources of uncertainty are not readily apparent. This example from The Ecological Detective is useful for appreciating the effects of uncertainty on parameter estimates in simple process models. I refer the reader to the actual text for more discussion regarding the implications of these analyses. Also, check out Ben Bolker’s text2 (chapter 11) for more discussion with R examples. Cheers, Marcus 1Hilborn R, Mangel M. 1997. The Ecological Detective: Confronting Models With Data. Monographs in Population Biology 28. Princeton University Press. Princeton, New Jersey. 315 pages. 2Bolker B. 2007. Ecological Models and Data in R. Princeton University Press. Princeton, New Jersey. 508 pages. # Brief introduction on Sweave and Knitr for reproducible research A few weeks ago I gave a presentation on using Sweave and Knitr under the guise of promoting reproducible research. I humbly offer this presentation to the blog with full knowledge that there are already loads of tutorials available online. This presentation is $\LaTeX$ specific and slightly biased towards Windows OS, so it probably has limited relevance if you are interested in other methods. Anyhow, I hope this is useful to some of you. Cheers, Marcus \documentclass[xcolor=svgnames]{beamer} %\documentclass[xcolor=svgnames,handout]{beamer} \usetheme{Boadilla} \usecolortheme[named=Sienna]{structure} \usepackage{graphicx} \usepackage[final]{animate} %\usepackage[colorlinks=true,urlcolor=blue,citecolor=blue,linkcolor=blue]{hyperref} \usepackage{breqn} \usepackage{xcolor} \usepackage{booktabs} \usepackage{verbatim} \usepackage{tikz} \usetikzlibrary{shadows,arrows,positioning} \usepackage[noae]{Sweave} \definecolor{links}{HTML}{2A1B81} \hypersetup{colorlinks,linkcolor=links,urlcolor=links} \usepackage{pgfpages} %\pgfpagesuselayout{4 on 1}[letterpaper, border shrink = 5mm, landscape] \tikzstyle{block} = [rectangle, draw, text width=7em, text centered, rounded corners, minimum height=3em, minimum width=7em, top color = white, bottom color=brown!30, drop shadow] \newcommand{\ShowSexpr}[1]{\texttt{{\char\\}Sexpr\{#1\}}} \begin{document} \SweaveOpts{concordance=TRUE} \title[Nuts and bolts of Sweave/Knitr]{The nuts and bolts of Sweave/Knitr for reproducible research with \LaTeX} \author[M. Beck]{Marcus W. Beck} \institute[USEPA NHEERL]{ORISE Post-doc Fellow\\ USEPA NHEERL Gulf Ecology Division, Gulf Breeze, FL\\ Email: \href{mailto:beck.marcus@epa.gov}{beck.marcus@epa.gov}, Phone: 850 934 2480} \date{January 15, 2014} %%%%%% \begin{frame} \vspace{-0.3in} \titlepage \end{frame} %%%%%% \begin{frame}{Reproducible research} \onslide<+-> In it's most general sense... the ability to reproduce results from an experiment or analysis conducted by another.\\~\\ \onslide<+-> From Wikipedia... The ultimate product is the \alert{paper along with the full computational environment} used to produce the results in the paper such as the code, data, etc. that can be \alert{used to reproduce the results and create new work} based on the research.'\\~\\ \onslide<+-> Concept is strongly based on the idea of \alert{literate programming} such that the logic of the analysis is clearly represented in the final product by combining computer code/programs with ordinary human language [Knuth, 1992]. \end{frame} %%%%%% \begin{frame}{Non-reproducible research} \begin{center} \begin{tikzpicture}[node distance=2.5cm, auto, >=stealth] \onslide<2->{ \node[block] (a) {1. Gather data};} \onslide<3->{ \node[block] (b) [right of=a, node distance=4.2cm] {2. Analyze data}; \draw[->] (a) -- (b);} \onslide<4->{ \node[block] (c) [right of=b, node distance=4.2cm] {3. Report results}; \draw[->] (b) -- (c);} % \onslide<5->{ % \node [right of=a, node distance=2.1cm] {{X}}; % \node [right of=b, node distance=2.1cm] {{X}};} \end{tikzpicture} \end{center} \vspace{-0.5cm} \begin{columns}[t] \onslide<2->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Begins with general question or research objectives \item Data collected in raw format (hard copy) converted to digital (Excel spreadsheet) \end{itemize} \end{column}} \onslide<3->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Import data into stats program or analyze directly in Excel \item Create figures/tables directly in stats program \item Save relevant output \end{itemize} \end{column}} \onslide<4->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Create research report using Word or other software \item Manually insert results into report \item Change final report by hand if methods/analysis altered \end{itemize} \end{column}} \end{columns} \end{frame} %%%%%% \begin{frame}{Reproducible research} \begin{center} \begin{tikzpicture}[node distance=2.5cm, auto, >=stealth] \onslide<1->{ \node[block] (a) {1. Gather data};} \onslide<1->{ \node[block] (b) [right of=a, node distance=4.2cm] {2. Analyze data}; \draw[<->] (a) -- (b);} \onslide<1->{ \node[block] (c) [right of=b, node distance=4.2cm] {3. Report results}; \draw[<->] (b) -- (c);} \end{tikzpicture} \end{center} \vspace{-0.5cm} \begin{columns}[t] \onslide<1->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Begins with general question or research objectives \item Data collected in raw format (hard copy) converted to digital (\alert{text file}) \end{itemize} \end{column}} \onslide<1->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Create \alert{integrated script} for importing data (data path is known) \item Create figures/tables directly in stats program \item \alert{No need to export} (reproduced on the fly) \end{itemize} \end{column}} \onslide<1->{ \begin{column}{0.33\textwidth} \begin{itemize} \item Create research report using RR software \item \alert{Automatically include results} into report \item \alert{Change final report automatically} if methods/analysis altered \end{itemize} \end{column}} \end{columns} \end{frame} %%%%%% \begin{frame}{Reproducible research in R} Easily adopted using RStudio [\href{http://www.rstudio.com/}{http://www.rstudio.com/}]\\~\\ Also possible w/ Tinn-R or via command prompt but not as intuitive\\~\\ Requires a \LaTeX\ distribution system - use MikTex for Windows [\href{http://miktex.org/}{http://miktex.org/}]\\~\\ \onslide<2->{ Essentially a \LaTeX\ document that incorporates R code... \\~\\ Uses Sweave (or Knitr) to convert .Rnw file to .tex file, then \LaTeX\ to create pdf\\~\\ Sweave comes with \texttt{utils} package, may have to tell R where it is \\~\\ } \end{frame} %%%%%% \begin{frame}{Reproducible research in R} Use same procedure for compiling a \LaTeX\ document with one additional step \begin{center} \begin{tikzpicture}[node distance=2.5cm, auto, >=stealth] \onslide<2->{ \node[block] (a) {1. myfile.Rnw};} \onslide<3->{ \node[block] (b) [right of=a, node distance=4.2cm] {2. myfile.tex}; \draw[->] (a) -- (b);\node [right of=a, above=0.5cm, node distance=2.1cm] {Sweave};} \onslide<4->{ \node[block] (c) [right of=b, node distance=4.2cm] {3. myfile.pdf}; \draw[->] (b) -- (c); \node [right of=b, above=0.5cm, node distance=2.1cm] {pdfLatex};} \end{tikzpicture} \end{center} \vspace{-0.5cm} \begin{columns}[t] \onslide<2->{ \begin{column}{0.33\textwidth} \begin{itemize} \item A .tex file but with .Rnw extension \item Includes R code as chunks' or inline expressions \end{itemize} \end{column}} \onslide<3->{ \begin{column}{0.33\textwidth} \begin{itemize} \item .Rnw file is converted to a .tex file using Sweave \item .tex file contains output from R, no raw R code \end{itemize} \end{column}} \onslide<4->{ \begin{column}{0.33\textwidth} \begin{itemize} \item .tex file converted to pdf (or other output) for final format \item Include biblio with bibtex \end{itemize} \end{column}} \end{columns} \end{frame} %%%%%% \begin{frame}[containsverbatim]{Reproducible research in R} \label{sweaveref} \begin{block}{.Rnw file} \begin{verbatim} \documentclass{article} \usepackage{Sweave} \begin{document} Here's some R code: \Sexpr{'<<eval=true,echo=true>>='} options(width=60) set.seed(2) rnorm(10) \Sexpr{'@'} \end{document} \end{verbatim} \end{block} \end{frame} %%%%%% \begin{frame}[containsverbatim,shrink]{Reproducible research in R} \begin{block}{.tex file} \begin{verbatim} \documentclass{article} \usepackage{Sweave} \begin{document} Here's some R code: \begin{Schunk} \begin{Sinput} > options(width=60) > set.seed(2) > rnorm(10) \end{Sinput} \begin{Soutput} [1] -0.89691455 0.18484918 1.58784533 -1.13037567 [5] -0.08025176 0.13242028 0.70795473 -0.23969802 [9] 1.98447394 -0.13878701 \end{Soutput} \end{Schunk} \end{document} \end{verbatim} \end{block} \end{frame} %%%%%% \begin{frame}{Reproducible research in R} The final product:\\~\\ \centerline{\includegraphics{ex1_input.pdf}} \end{frame} %%%%%% \begin{frame}[fragile]{Sweave - code chunks} \onslide<+-> R code is entered in the \LaTeX\ document using code chunks' \begin{block}{} \begin{verbatim} \Sexpr{'<<>>='} \Sexpr{'@'} \end{verbatim} \end{block} Any text within the code chunk is interpreted as R code\\~\\ Arguments for the code chunk are entered within \verb|\Sexpr{'<<here>>'}|\\~\\ \onslide<+-> \begin{itemize} \item{\texttt{eval}: evaluate code, default \texttt{T}} \item{\texttt{echo}: return source code, default \texttt{T}} \item{\texttt{results}: format of output (chr string), default is include' (also tex' for tables or hide' to suppress)} \item{\texttt{fig}: for creating figures, default \texttt{F}} \end{itemize} \end{frame} %%%%%% \begin{frame}[fragile]{Sweave - code chunks} Changing the default arguments for the code chunk: \begin{columns}[t] \begin{column}{0.45\textwidth} \onslide<+-> \begin{block}{} \begin{verbatim} \Sexpr{'<<>>='} 2+2 \Sexpr{'@'} \end{verbatim} \end{block} <<>>= 2+2 @ \onslide<+-> \begin{block}{} \begin{verbatim} \Sexpr{'<<eval=F,echo=F>>='} 2+2 \Sexpr{'@'} \end{verbatim} \end{block} Returns nothing... \end{column} \begin{column}{0.45\textwidth} \onslide<+-> \begin{block}{} \begin{verbatim} \Sexpr{'<<eval=F>>='} 2+2 \Sexpr{'@'} \end{verbatim} \end{block} <<eval=F>>= 2+2 @ \onslide<+-> \begin{block}{} \begin{verbatim} \Sexpr{'<<echo=F>>='} 2+2 \Sexpr{'@'} \end{verbatim} \end{block} <<echo=F>>= 2+2 @ \end{column} \end{columns} \end{frame} %%%%%% \begin{frame}[t,fragile]{Sweave - figures} \onslide<1-> Sweave makes it easy to include figures in your document \begin{block}{} \begin{verbatim} \Sexpr{'<<myfig,fig=T,echo=F,include=T,height=3>>='} set.seed(2) hist(rnorm(100)) \Sexpr{'@'} \end{verbatim} \end{block} \onslide<2-> <<myfig,fig=T,echo=F,include=T,height=3>>= set.seed(2) hist(rnorm(100)) @ \end{frame} %%%%%% \begin{frame}[t,fragile]{Sweave - figures} Sweave makes it easy to include figures in your document \begin{block}{} \begin{verbatim} \Sexpr{'<<myfig,fig=T,echo=F,include=T,height=3>>='} set.seed(2) hist(rnorm(100)) \Sexpr{'@'} \end{verbatim} \end{block} \vspace{\baselineskip} Relevant code options for figures: \begin{itemize} \item{The chunk name is used to name the figure, myfile-myfig.pdf} \item{\texttt{fig}: Lets R know the output is a figure} \item{\texttt{echo}: Use \texttt{F} to suppress figure code} \item{\texttt{include}: Should the figure be automatically include in output} \item{\texttt{height}: (and \texttt{width}) Set dimensions of figure in inches} \end{itemize} \end{frame} %%%%%% \begin{frame}[t,fragile]{Sweave - figures} An alternative approach for creating a figure \begin{block}{} \begin{verbatim} \Sexpr{'<<myfig,fig=T,echo=F,include=F,height=3>>='} set.seed(2) hist(rnorm(100)) \Sexpr{'@'} \includegraphics{rnw_name-myfig.pdf} \end{verbatim} \end{block} \includegraphics{Sweave_intro-myfig.pdf} \end{frame} %%%%%% \begin{frame}[t,fragile]{Sweave - tables} \onslide<1-> Really easy to create tables \begin{block}{} \begin{verbatim} \Sexpr{'<<results=tex,echo=F>>='} library(stargazer) data(iris) stargazer(iris,title='Summary statistics for Iris data') \Sexpr{'@'} \end{verbatim} \end{block} \onslide<2-> <<results=tex,echo=F>>= data(iris) library(stargazer) stargazer(iris,title='Summary statistics for Iris data') @ \end{frame} %%%%%% \begin{frame}[t,fragile]{Sweave - tables} Really easy to create tables \begin{block}{} \begin{verbatim} \Sexpr{'<<results=tex,echo=F>>='} library(stargazer) data(iris) stargazer(iris,title='Summary statistics for Iris data') \Sexpr{'@'} \end{verbatim} \end{block} \vspace{\baselineskip} \texttt{results} option should be set to tex' (and \texttt{echo=F})\\~\\ Several packages are available to convert R output to \LaTeX\ table format \begin{itemize} \item{xtable: most general package} \item{hmisc: similar to xtable but can handle specific R model objects} \item{stargazer: fairly effortless conversion of R model objects to tables} \end{itemize} \end{frame} %%%%%% \begin{frame}[fragile]{Sweave - expressions} \onslide<1-> All objects within a code chunk are saved in the workspace each time a document is compiled (unless \texttt{eval=F})\\~\\ This allows the information saved in the workspace to be reproduced in the final document as inline text, via \alert{expressions}\\~\\ \onslide<2-> \begin{block}{} \begin{verbatim} \Sexpr{'<<echo=F>>='} data(iris) dat<-iris \Sexpr{'@'} \end{verbatim} Mean sepal length was \ShowSexpr{mean(dat\$Sepal.Length)}.
\end{block}
\onslide<3->
<<echo=F>>=
data(iris)
dat<-iris
@
\vspace{\baselineskip}
Mean sepal length was \Sexpr{mean(dat$Sepal.Length)}. \end{frame} %%%%%% \begin{frame}[fragile]{Sweave - expressions} Change the global R options to change the default output\\~\\ \begin{block}{} \begin{verbatim} \Sexpr{'<<echo=F>>='} data(iris) dat<-iris options(digits=2) \Sexpr{'@'} \end{verbatim} Mean sepal length was \ShowSexpr{format(mean(dat\$Sepal.Length))}.
\end{block}
<<echo=F>>=
data(iris)
dat<-iris
options(digits=2)
@
\vspace{\baselineskip}
Mean sepal length was \Sexpr{format(mean(dat$Sepal.Length))}.\\~\\ \end{frame} %%%%%% \begin{frame}{Sweave vs Knitr} \onslide<1-> Does not automatically cache R data on compilation\\~\\ \alert{Knitr} is a useful alternative - similar to Sweave but with minor differences in args for code chunks, more flexible output\\~\\ \onslide<2-> \begin{columns} \begin{column}{0.3\textwidth} Must change default options in RStudio\\~\\ Knitr included with RStudio, otherwise download as package \end{column} \begin{column}{0.6\textwidth} \centerline{\includegraphics[width=0.8\textwidth]{options_ex.png}} \end{column} \end{columns} \end{frame} %%%%%% \begin{frame}[fragile]{Knitr} \onslide<1-> Knitr can be used to cache code chunks\\~\\ Date are saved when chunk is first evaluated, skipped on future compilations unless changed\\~\\ This allows quicker compilation of documents that import lots of data\\ ~\\ \begin{block}{} \begin{verbatim} \Sexpr{'<<mychunk, cache=TRUE, eval=FALSE>>='} load(file='mydata.RData') \Sexpr{'@'} \end{verbatim} \end{block} \end{frame} %%%%%% \begin{frame}[containsverbatim,shrink]{Knitr} \label{knitref} \begin{block}{.Rnw file} \begin{verbatim} \documentclass{article} \Sexpr{'<<setup, include=FALSE, cache=FALSE>>='} library(knitr) #set global chunk options opts_chunk$set(fig.path='H:/docs/figs/', fig.align='center',
dev='pdf', dev.args=list(family='serif'), fig.pos='!ht')

options(width=60)
\Sexpr{'@'}

\begin{document}

Here's some R code:

\Sexpr{'<<eval=T, echo=T>>='}
set.seed(2)
rnorm(10)
\Sexpr{'@'}

\end{document}
\end{verbatim}
\end{block}

\end{frame}

%%%%%%
\begin{frame}{Knitr}
The final product:\\~\\
\centerline{\includegraphics[width=\textwidth]{knit_ex.pdf}}
\end{frame}

%%%%%%
\begin{frame}[containsverbatim,shrink]{Knitr}
Figures, tables, and expressions are largely the same as in Sweave\\~\\

\begin{block}{Figures}
\begin{verbatim}
\Sexpr{'<<myfig,echo=F>>='}
set.seed(2)
hist(rnorm(100))
\Sexpr{'@'}
\end{verbatim}
\end{block}
\vspace{\baselineskip}
\begin{block}{Tables}
\begin{verbatim}
\Sexpr{"<<mytable,results='asis',echo=F,message=F>>="}
library(stargazer)
data(iris)
stargazer(iris,title='Summary statistics for Iris data')
\Sexpr{'@'}
\end{verbatim}
\end{block}

\end{frame}

%%%%%%
\begin{frame}{A minimal working example}
\onslide<1->
Step by step guide to creating your first RR document\\~\\
\begin{enumerate}
\onslide<2->
\item Download and install \href{http://www.rstudio.com/}{RStudio}
\onslide<3->
\item Dowload and install \href{http://miktex.org/}{MikTeX} if using Windows
\onslide<4->
\item Create a unique folder for the document - This will be the working directory
\onslide<5->
\item Open a new Sweave file in RStudio
\onslide<6->
\item Copy and paste the file found on slide \ref{sweaveref} for Sweave or slide \ref{knitref} for Knitr into the new file (and select correct compile option)
\onslide<7->
\item Compile the pdf (runs Sweave/Knitr, then pdfLatex)\\~\\
\end{enumerate}
\onslide<7->
\centerline{\includegraphics[width=0.6\textwidth]{compile_ex.png}}
\end{frame}

%%%%%%
\begin{frame}{If things go wrong...}
\LaTeX\ Errors can be difficult to narrow down - check the log file\\~\\
Sweave/Knitr errors will be displayed on the console\\~\\
Other resources
\begin{itemize}
\item{Reproducible Research with R and RStudio' by C. Garund, CRC Press}
\item{\LaTeX forum (like StackOverflow) \href{http://www.latex-community.org/forum/}{http://www.latex-community.org/forum/}}
\item Comprehensive Knitr guide \href{http://yihui.name/knitr/options}{http://yihui.name/knitr/options}
\item Sweave user manual \href{http://stat.ethz.ch/R-manual/R-devel/library/utils/doc/Sweave.pdf}{http://stat.ethz.ch/R-manual/R-devel/library/utils/doc/Sweave.pdf}
\item Intro to Sweave \href{http://www.math.ualberta.ca/~mlewis/links/the_joy_of_sweave_v1.pdf}{http://www.math.ualberta.ca/~mlewis/links/the_joy_of_sweave_v1.pdf}
\end{itemize}
\vspace{\baselineskip}
\end{frame}

\end{document}


# A brief foray into parallel processing with R

I’ve recently been dabbling with parallel processing in R and have found the foreach package to be a useful approach to increasing efficiency of loops. To date, I haven’t had much of a need for these tools but I’ve started working with large datasets that can be cumbersome to manage. My first introduction to parallel processing was somewhat intimidating since I am surprisingly naive about basic computer jargon – processors, CPUs, RAM, flux capacitors, etc. According the the CRAN task view, parallel processing became directly available in R beginning with version 2.14.0, and a quick look at the web page provides an astounding group of packages that explicitly or implicitly allow parallel processing utilities.

In my early days of programming I made liberal use of for loops for repetitive tasks. Not until much later I realized that for loops are incredibly inefficient at processing data. This is common knowledge among programmers but I was completely unaware of these issues given my background in the environmental sciences. I had always assumed that my hardware was more than sufficient for any data analysis needs, regardless of poor programming techniques. After a few watershed moments I soon learned the error of my ways and starting adopting more efficient coding techniques, e.g., vectorizing, apply functions, etc., in addition to parallel processing.

A couple months ago I started using the foreach package with for loops. To be honest, I think loops are unavoidable at times regardless of how efficient you are with programming. Two things struck me when I starting using this package. First, I probably could have finished my dissertation about a year earlier had I been using parallel processing. And two, the functions are incredibly easy to use even if you don’t understand all of the nuances and jargon of computer speak. My intent of this blog is to describe how the foreach package can be used to quickly transform traditional for loops to allow parallel processing. Needless to mention, numerous tutorials covering this topic can be found with a quick Google search. I hope my contribution helps those with little or no experience in parallel processing to adopt some of these incredibly useful tools.

I’ll use a trivial example of a for loop to illustrate repeated execution of a simple task. For 10 iterations, we are creating a normally-distributed random variable (1000000 samples), taking a summary, and appending the output to a list.

#number of iterations in the loop
iters<-10

#vector for appending output
ls<-vector('list',length=iters)

#start time
strt<-Sys.time()

#loop
for(i in 1:iters){

#counter
cat(i,'\n')

to.ls<-rnorm(1e6)
to.ls<-summary(to.ls)

#export
ls[[i]]<-to.ls

}

#end time
print(Sys.time()-strt)
# Time difference of 2.944168 secs


The code executes quickly so we don’t need to worry about computation time in this example. For fun, we can see how computation time increases if we increase the number of iterations. I’ve repeated the above code with an increasing number of iterations, 10 to 100 at intervals of 10.

#iterations to time
iters<-seq(10,100,by=10)

#output time vector for  iteration sets
times<-numeric(length(iters))

#loop over iteration sets
for(val in 1:length(iters)){

cat(val,' of ', length(iters),'\n')

to.iter<-iters[val]

#vector for appending output
ls<-vector('list',length=to.iter)

#start time
strt<-Sys.time()

#same for loop as before
for(i in 1:to.iter){

cat(i,'\n')

to.ls<-rnorm(1e6)
to.ls<-summary(to.ls)

#export
ls[[i]]<-to.ls

}

#end time
times[val]<-Sys.time()-strt

}

#plot the times
library(ggplot2)

to.plo<-data.frame(iters,times)
ggplot(to.plo,aes(x=iters,y=times)) +
geom_point() +
geom_smooth() +
theme_bw() +
scale_x_continuous('No. of loop iterations') +
scale_y_continuous ('Time in seconds')


The processing time increases linearly with the number of iterations. Again, processing time is not extensive for the above example. Suppose we wanted to run the example with ten thousand iterations. We can predict how long that would take based on the linear relationship between time and iterations.

#predict times
mod<-lm(times~iters)
predict(mod,newdata=data.frame(iters=1e4))/60
# 45.75964


This is all well and good if we want to wait around for 45 minutes. Running the loop in parallel would greatly decrease this time. I want to first illustrate the problem of running loops in sequence before I show how this can done using the foreach package. If the above code is run with 1e4 iterations, a quick look at the performance metrics in the task manager (Windows 7 OS) gives you an idea of how hard your computer is working to process the code. My machine has eight processors and you can see that only a fraction of them are working while the script is running.

Running the code using foreach will make full use of the computer’s processors. Individual chunks of the loop are sent to each processor so that the entire process can be run in parallel rather than in sequence. That is, each processor gets a finite set of the total number of iterations, i.e., iterations 1–100 goes to processor one, iterations 101–200 go to processor two, etc. The output from each processor is then compiled after the iterations are completed. Here’s how to run the code with 1e4 iterations in parallel.

#import packages
library(foreach)
library(doParallel)

#number of iterations
iters<-1e4

#setup parallel backend to use 8 processors
cl<-makeCluster(8)
registerDoParallel(cl)

#start time
strt<-Sys.time()

#loop
ls<-foreach(icount(iters)) %dopar% {

to.ls<-rnorm(1e6)
to.ls<-summary(to.ls)
to.ls

}

print(Sys.time()-strt)
stopCluster(cl)

#Time difference of 10.00242 mins


Running the loop in parallel decreased the processing time about four-fold. Although the loop generally looks the same as the sequential version, several parts of the code have changed. First, we are using the foreach function rather than for to define our loop. The syntax for specifying the iterator is slightly different with foreach as well, i.e., icount(iters) tells the function to repeat the loop a given number of times based on the value assigned to iters. Additionally, the convention %dopar% specifies that the code is to be processed in parallel if a backend has been registered (using %do% will run the loop sequentially). The functions makeParallel and registerDoParallel from the doParallel package are used to create the parallel backend. Another important issue is the method for recombining the data after the chunks are processed. By default, foreach will append the output to a list which we’ve saved to an object. The default method for recombining output can be changed using the .combine argument. Also be aware that packages used in the evaluated expression must be included with the .packages argument.

The processors should be working at full capacity if the the loop is executed properly. Note the difference here compared to the first loop that was run in sequence.

A few other issues are worth noting when using the foreach package. These are mainly issues I’ve encountered and I’m sure others could contribute to this list. The foreach package does not work with all types of loops. I can’t say for certain the exact type of data that works best, but I have found that functions that take a long time when run individually are generally handled very well. For example, I chose the above example to use a large number (1e6) of observations with the rnorm function. Interestingly, decreasing the number of observations and increasing the number of iterations may cause the processors to not run at maximum efficiency (try rnorm(100) with 1e5 iterations). I also haven’t had much success running repeated models in parallel. The functions work but the processors never seem to reach max efficiency. The system statistics should cue you off as to whether or not the functions are working.

I also find it bothersome that monitoring progress seems is an issue with parallel loops. A simple call using cat to return the iteration in the console does not work with parallel loops. The most practical solution I’ve found is described here, which involves exporting information to a separate file that tells you how far the loop has progressed. Also, be very aware of your RAM when running processes in parallel. I’ve found that it’s incredibly easy to max out the memory, which not only causes the function to stop working correctly, but also makes your computer run like garbage. Finally, I’m a little concerned that I might be destroying my processors by running them at maximum capacity. The fan always runs at full blast leading me to believe that critical meltdown is imminent. I’d be pleased to know if this is an issue or not.

That’s it for now. I have to give credit to this tutorial for a lot of the information in this post. There are many, many other approaches to parallel processing in R and I hope this post has been useful for describing a few of these simple tools.

Cheers,

Marcus

# Visualizing neural networks in R – update

In my last post I said I wasn’t going to write anymore about neural networks (i.e., multilayer feedforward perceptron, supervised ANN, etc.). That was a lie. I’ve received several requests to update the neural network plotting function described in the original post. As previously explained, R does not provide a lot of options for visualizing neural networks. The only option I know of is a plotting method for objects from the neuralnet package. This may be my opinion, but I think this plot leaves much to be desired (see below). Also, no plotting methods exist for neural networks created in other packages, i.e., nnet and RSNNS. These packages are the only ones listed on the CRAN task view, so I’ve updated my original plotting function to work with all three. Additionally, I’ve added a new option for plotting a raw weight vector to allow use with neural networks created elsewhere. This blog describes these changes, as well as some new arguments added to the original function.

As usual, I’ll simulate some data to use for creating the neural networks. The dataset contains eight input variables and two output variables. The final dataset is a data frame with all variables, as well as separate data frames for the input and output variables. I’ve retained separate datasets based on the syntax for each package.

library(clusterGeneration)

seed.val<-2
set.seed(seed.val)

num.vars<-8
num.obs<-1000

#input variables
cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat) #output variables parms<-runif(num.vars,-10,10) y1<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20) parms2<-runif(num.vars,-10,10) y2<-rand.vars %*% matrix(parms2) + rnorm(num.obs,sd=20) #final datasets rand.vars<-data.frame(rand.vars) resp<-data.frame(y1,y2) names(resp)<-c('Y1','Y2') dat.in<-data.frame(resp,rand.vars)  The various neural network packages are used to create separate models for plotting. #nnet function from nnet package library(nnet) set.seed(seed.val) mod1<-nnet(rand.vars,resp,data=dat.in,size=10,linout=T) #neuralnet function from neuralnet package, notice use of only one response library(neuralnet) form.in<-as.formula('Y1~X1+X2+X3+X4+X5+X6+X7+X8') set.seed(seed.val) mod2<-neuralnet(form.in,data=dat.in,hidden=10) #mlp function from RSNNS package library(RSNNS) set.seed(seed.val) mod3<-mlp(rand.vars, resp, size=10,linOut=T)  I’ve noticed some differences between the functions that could lead to some confusion. For simplicity, the above code represents my interpretation of the most direct way to create a neural network in each package. Be very aware that direct comparison of results is not advised given that the default arguments differ between the packages. A few key differences are as follows, although many others should be noted. First, the functions differ in the methods for passing the primary input variables. The nnet function can take separate (or combined) x and y inputs as data frames or as a formula, the neuralnet function can only use a formula as input, and the mlp function can only take a data frame as combined or separate variables as input. As far as I know, the neuralnet function is not capable of modelling multiple response variables, unless the response is a categorical variable that uses one node for each outcome. Additionally, the default output for the neuralnet function is linear, whereas the opposite is true for the other two functions. Specifics aside, here’s how to use the updated plot function. Note that the same syntax is used to plot each model. #import the function from Github library(devtools) source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r') #plot each model plot.nnet(mod1) plot.nnet(mod2) plot.nnet(mod3)  The neural networks for each model are shown above. Note that only one response variable is shown for the second plot. Also, neural networks created using mlp do not show bias layers, causing a warning to be returned. The documentation about bias layers for this function is lacking, although I have noticed that the model object returned by mlp does include information about ‘unitBias’ (see the output from mod3$snnsObject$getUnitDefinitions()). I wasn’t sure what this was so I excluded it from the plot. Bias layers aren’t all that informative anyway, since they are analogous to intercept terms in a regression model. Finally, the default variable labels differ for the mlp plot from the other two. I could not find any reference to the original variable names in the mlp object, so generic names returned by the function are used. I have also added five new arguments to the function. These include options to remove bias layers, remove variable labels, supply your own variable labels, and include the network architecture if using weights directly as input. The new arguments are shown in bold.  mod.in neural network object or numeric vector of weights, if model object must be from nnet, mlp, or neuralnet functions nid logical value indicating if neural interpretation diagram is plotted, default T all.out character string indicating names of response variables for which connections are plotted, default all all.in character string indicating names of input variables for which connections are plotted, default all bias logical value indicating if bias nodes and connections are plotted, not applicable for networks from mlp function, default T wts.only logical value indicating if connections weights are returned rather than a plot, default F rel.rsc numeric value indicating maximum width of connection lines, default 5 circle.cex numeric value indicating size of nodes, passed to cex argument, default 5 node.labs logical value indicating if labels are plotted directly on nodes, default T var.labs logical value indicating if variable names are plotted next to nodes, default T x.lab character string indicating names for input variables, default from model object y.lab character string indicating names for output variables, default from model object line.stag numeric value that specifies distance of connection weights from nodes struct numeric value of length three indicating network architecture(no. nodes for input, hidden, output), required only if mod.in is a numeric vector cex.val numeric value indicating size of text labels, default 1 alpha.val numeric value (0-1) indicating transparency of connections, default 1 circle.col character string indicating color of nodes, default ‘lightblue’, or two element list with first element indicating color of input nodes and second indicating color of remaining nodes pos.col character string indicating color of positive connection weights, default ‘black’ neg.col character string indicating color of negative connection weights, default ‘grey’ max.sp logical value indicating if space between nodes in each layer is maximized, default F ... additional arguments passed to generic plot function The plotting function can also now be used with an arbitrary weight vector, rather than a specific model object. The struct argument must also be included if this option is used. I thought the easiest way to use the plotting function with your own weights was to have the input weights as a numeric vector, including bias layers. I’ve shown how this can be done using the weights directly from mod1 for simplicity. wts.in<-mod1$wts
struct<-mod1$n plot.nnet(wts.in,struct=struct)  Note that wts.in is a numeric vector with length equal to the expected given the architecture (i.e., for 8 10 2 network, 100 connection weights plus 12 bias weights). The plot should look the same as the plot for the neural network from nnet. The weights in the input vector need to be in a specific order for correct plotting. I realize this is not clear by looking directly at wt.in but this was the simplest approach I could think of. The weight vector shows the weights for each hidden node in sequence, starting with the bias input for each node, then the weights for each output node in sequence, starting with the bias input for each output node. Note that the bias layer has to be included even if the network was not created with biases. If this is the case, simply input a random number where the bias values should go and use the argument bias=F. I’ll show the correct order of the weights using an example with plot.nn from the neuralnet package since the weights are included directly on the plot. If we pretend that the above figure wasn’t created in R, we would input the mod.in argument for the updated plotting function as follows. Also note that struct must be included if using this approach. mod.in<-c(13.12,1.49,0.16,-0.11,-0.19,-0.16,0.56,-0.52,0.81) struct<-c(2,2,1) #two inputs, two hidden, one output plot.nnet(mod.in,struct=struct)  Note the comparability with the figure created using the neuralnet package. That is, larger weights have thicker lines and color indicates sign (+ black, – grey). One of these days I’ll actually put these functions in a package. In the mean time, please let me know if any bugs are encountered. Cheers, Marcus Update: I’ve changed the function to work with neural networks created using the train function from the caret package. The link above is updated but you can also grab it here. mod4<-train(Y1~.,method='nnet',data=dat.in,linout=T) plot.nnet(mod4,nid=T)  Also, factor levels are now correctly plotted if using the nnet function. fact<-factor(sample(c('a','b','c'),size=num.obs,replace=T)) form.in<-formula('cbind(Y2,Y1)~X1+X2+X3+fact') mod5<-nnet(form.in,data=cbind(dat.in,fact),size=10,linout=T) plot.nnet(mod5,nid=T)  Update 2: More updates… I’ve now modified the function to plot multiple hidden layers for networks created using the mlp function in the RSNNS package and neuralnet in the neuralnet package. As far as I know, these are the only neural network functions in R that can create multiple hidden layers. All others use a single hidden layer. I have not tested the plotting function using manual input for the weight vectors with multiple hidden layers. My guess is it won’t work but I can’t be bothered to change the function unless it’s specifically requested. The updated function can be grabbed here (all above links to the function have also been changed). library(RSNNS) #neural net with three hidden layers, 9, 11, and 8 nodes in each mod<-mlp(rand.vars, resp, size=c(9,11,8),linOut=T) par(mar=numeric(4),family='serif') plot.nnet(mod)  Here’s an example using the neuralnet function with binary predictors and categorical outputs (credit to Tao Ma for the model code). library(neuralnet) #response AND<-c(rep(0,7),1) OR<-c(0,rep(1,7)) #response with predictors binary.data<-data.frame(expand.grid(c(0,1),c(0,1),c(0,1)),AND,OR) #model net<-neuralnet(AND+OR~Var1+Var2+Var3, binary.data,hidden=c(6,12,8),rep=10,err.fct="ce",linear.output=FALSE) #plot ouput par(mar=numeric(4),family='serif') plot.nnet(net)  Update 3: The color vector argument (circle.col) for the nodes was changed to allow a separate color vector for the input layer. The following example shows how this can be done using relative importance of the input variables to color-code the first layer. #example showing use of separate colors for input layer #color based on relative importance using 'gar.fun' ## #create input data seed.val<-3 set.seed(seed.val) num.vars<-8 num.obs<-1000 #input variables library(clusterGeneration) cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)

#output variables
parms<-runif(num.vars,-10,10)
y1<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

#final datasets
rand.vars<-data.frame(rand.vars)
resp<-data.frame(y1)
names(resp)<-'Y1'
dat.in<-data.frame(resp,rand.vars)

##
#create model
library(nnet)
mod1<-nnet(rand.vars,resp,data=dat.in,size=10,linout=T)

##
#relative importance function
library(devtools)
source_url('https://gist.github.com/fawda123/6206737/raw/2e1bc9cbc48d1a56d2a79dd1d33f414213f5f1b1/gar_fun.r')

#relative importance of input variables for Y1
rel.imp<-gar.fun('Y1',mod1,bar.plot=F)\$rel.imp

#color vector based on relative importance of input values
cols<-colorRampPalette(c('green','red'))(num.vars)[rank(rel.imp)]

##
#plotting function
source_url('https://gist.githubusercontent.com/fawda123/7471137/raw/466c1474d0a505ff044412703516c34f1a4684a5/nnet_plot_update.r')

#plot model with new color vector
#separate colors for input vectors using a list for 'circle.col'
plot(mod1,circle.col=list(cols,'lightblue'))
`