# Collinearity and stepwise VIF selection

Collinearity, or excessive correlation among explanatory variables, can complicate or prevent the identification of an optimal set of explanatory variables for a statistical model. For example, forward or backward selection of variables could produce inconsistent results, variance partitioning analyses may be unable to identify unique sources of variation, or parameter estimates may include substantial amounts of uncertainty. The temptation to build an ecological model using all available information (i.e., all variables) is hard to resist. Lots of time and money are exhausted gathering data and supporting information. We also hope to identify every significant variable to more accurately characterize relationships with biological relevance. Analytical limitations related to collinearity require us to think carefully about the variables we choose to model, rather than adopting a naive approach where we blindly use all information to understand complexity. The purpose of this blog is to illustrate use of some techniques to reduce collinearity among explanatory variables using a simulated dataset with a known correlation structure.

A simple approach to identify collinearity among explanatory variables is the use of variance inflation factors (VIF). VIF calculations are straightforward and easily comprehensible; the higher the value, the higher the collinearity. A VIF for a single explanatory variable is obtained using the r-squared value of the regression of that variable against all other explanatory variables:

$\displaystyle VIF_j=\frac{1}{1-R_j^2}$

where the $VIF$ for variable $j$ is the reciprocal of the inverse of $R^2$ from the regression. A VIF is calculated for each explanatory variable and those with high values are removed. The definition of ‘high’ is somewhat arbitrary but values in the range of 5-10 are commonly used.

Several packages in R provide functions to calculate VIF: vif in package HH, vif in package car, VIF in package fmsb, vif in package faraway, and vif in package VIF. The number of packages that provide VIF functions is surprising given that they all seem to accomplish the same thing. One exception is the function in the VIF package, which can be used to create linear models using VIF-regression. The nuts and bolts of this function are a little unclear since the documentation for the package is sparse. However, what this function does accomplish is something that the others do not: stepwise selection of variables using VIF. Removing individual variables with high VIF values is insufficient in the initial comparison using the full set of explanatory variables. The VIF values will change after each variable is removed. Accordingly, a more thorough implementation of the VIF function is to use a stepwise approach until all VIF values are below a desired threshold. For example, using the full set of explanatory variables, calculate a VIF for each variable, remove the variable with the single highest value, recalculate all VIF values with the new set of variables, remove the variable with the next highest value, and so on, until all values are below the threshold.

In this blog we’ll use a custom function for stepwise variable selection. I’ve created this function because I think it provides a useful example for exploring stepwise VIF analysis. The function is a wrapper for the vif function in fmsb. We’ll start by simulating a dataset with a known correlation structure.

 require(MASS) require(clusterGeneration) set.seed(2) num.vars<-15 num.obs<-200 cov.mat<-genPositiveDefMat(num.vars,covMethod="unifcorrmat")$Sigma rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)  We’ve created fifteen ‘explanatory’ variables with 200 observations each. The mvrnorm function (MASS package) was used to create the data using a covariance matrix from the genPositiveDefMat function (clusterGeneration package). These functions provide a really simple approach to creating data matrices with arbitrary correlation structures. The covariance matrix was chosen from a uniform distribution such that some variables are correlated while some are not. A more thorough explanation about creating correlated data matrices can be found here. The correlation matrix for the random variables should look very similar to the correlation matrix from the actual values (as sample size increases, the correlation matrix approaches cov.mat). Now we create our response variable as a linear combination of the explanatory variables. First, we create a vector for the parameters describing the relationship of the response variable with the explanatory variables. Then, we use some matrix algebra and a randomly distributed error term to create the response variable. This is the standard form for a linear regression model.  parms<-runif(num.vars,-10,10) y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)  We would expect a regression model to indicate each of the fifteen explanatory variables are significantly related to the response variable, since we know the true relationship of y with each of the variables. However, our explanatory variables are correlated. What happens when we create the model?  lm.dat<-data.frame(y,rand.vars) form.in<-paste('y ~',paste(names(lm.dat)[-1],collapse='+')) mod1<-lm(form.in,data=lm.dat) summary(mod1)  The model shows that only four of the fifteen explanatory variables are significantly related to the response variable (at $\alpha = 0.05$), yet we know that every one of the variables is related to y. As we’ll see later, the standard errors are also quite large. We can try an alternative approach to building the model that accounts for collinearity among the explanatory variables. We can implement the custom VIF function as follows.  vif_func(in_frame=rand.vars,thresh=5,trace=T) var vif X1 27.7352782054202 X2 36.8947196546879 X3 12.5694198086941 X4 50.7385544899845 X5 8.35069942629285 X6 114.685122241139 X7 67.3415420139211 X8 153.597012767649 X9 48.226662808638 X10 50.7371404106266 X11 33.9720046917178 X12 43.2541022358368 X13 12.0823286959991 X14 74.6186892947576 X15 29.8722459010406 removed: X8 153.597 var vif X1 6.67306561938667 X2 7.98347501302268 X3 4.56187657632574 X4 8.03048468634153 X5 7.70736760805487 X6 19.6743072270573 X7 52.9521670096974 X9 17.8683960730445 X10 46.2484642889962 X11 18.2479446141727 X12 42.133697798185 X13 10.8973377491163 X14 37.9296952803818 X15 21.5847028917955 removed: X7 52.95217 var vif X1 6.54376168051204 X2 7.68236114754164 X3 4.04873004990332 X4 5.08958904348524 X5 2.65685239947949 X6 9.12685384862522 X9 2.89940351012031 X10 4.24712217472346 X11 4.45202381077724 X12 12.8835110845825 X13 1.92759852488083 X14 6.02382346000219 X15 8.33332235677386 removed: X12 12.88351 var vif X1 4.21690743815539 X2 6.88058249786417 X3 3.88265091747854 X4 4.48146995293666 X5 2.20144300270463 X6 7.76775127218149 X9 2.71324446993905 X10 2.90517847805482 X11 4.43541888871566 X13 1.78221291029774 X14 5.02289299193397 X15 3.02196822776011 removed: X6 7.767751 [1] "X1" "X2" "X3" "X4" "X5" "X9" "X10" "X11" "X13" "X14" "X15"  The function uses three arguments. The first is a matrix or data frame of the explanatory variables, the second is the threshold value to use for retaining variables, and the third is a logical argument indicating if text output is returned as the stepwise selection progresses. The output indicates the VIF values for each variable after each stepwise comparison. The function calculates the VIF values for all explanatory variables, removes the variable with the highest value, and repeats until all VIF values are below the threshold. The final output is a list of variable names with VIF values that fall below the threshold. Now we can create a linear model using explanatory variables with less collinearity.  keep.dat<-vif_func(in_frame=rand.vars,thresh=5,trace=F) form.in<-paste('y ~',paste(keep.dat,collapse='+')) mod2<-lm(form.in,data=lm.dat) summary(mod2)  The updated regression model is much improved over the original. We see an increase in the number of variables that are significantly related to the response variable. This increase is directly related to the standard error estimates for the parameters, which look at least 50% smaller than those in the first model. The take home message is that true relationships among variables will be masked if explanatory variables are collinear. This creates problems in model creation which lead to complications in model inference. Taking the extra time to evaluate collinearity is a critical first step to creating more robust ecological models. Function and example code: ## 53 thoughts on “Collinearity and stepwise VIF selection” 1. Visualizing neural networks from the nnet package – R is my friend 2. Visualizing neural networks from the nnet package | spider's space 3. Fred Mertz says: This is a good idea, but why compute VIF = 1/(1-R^2) ? This is a strictly increasing function of R^2, so why not just use R^2 ? 4. This is an excellent method for automating feature selection in multivariate linear models. I was able to significantly improve predictability by applying this to the input frame before LM model fitting. Thanks. • Excellent, I’m happy my code was useful! 5. IaIe says: Thank you for sharing your code, I find it very helpful! I used it to detect collinearity among variables calculated for about 50 objects. The code worked perfectly when the initial number of variable was less than 30. When I tried with more variables (up to 100) for the same 50 objects, I got the following error: Error in if (vif_max < thresh) { : missing value where TRUE/FALSE needed I am not expert at using R and I'd like to ask you whether you know the possible cause of this problem. Could it really depend on the number of variables included? Thank you again for the code!! • Hi there, sorry you’re having problems. I can’t reproduce the error using my code: set.seed(2) num.vars<-100 num.obs<-200 cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)

#remove collinear variables and recreate lm
vif_func(in_frame=rand.vars,thresh=5,trace=T)


Could you provide an example with your code?

• IaIe says:

Hi, I did not create the ‘explanatory’ variables (rand.vars), but I used an experimental dataset (X) where each variable represents a certain characteristic of the objects. So I used your code as such on my X matrix, but first I removed all variables with standard deviation equal to zero (same value across all objects). Successively I ran the code on all 100 remaining variables; as I got that error, I only included the first 30 variables and then it worked.
Anyway, as it is working with your 100 ‘random’ variables, the problem must lie in the values of my variables. I will control the dataset better and try again!

• I’d be curious to hear your solution, as the problem very well could be with my function.

• MTV says:

I am receiving the same error message IaIe got: Error in if (vif_max < thresh) { : missing value where TRUE/FALSE needed. I am importing a .csv with 138 explanatory variables, which I don't believe is an issue given your example. There are some variables where every observation is the same. However, I cannot afford to kick out any variables with a standard devation equal to zero like IaIe did. I was wondering if you all found a solution to this by chance? Thanks!

• Hi MTV,

Again, I cannot reproduce the error using my code. I used the VIF function with a simulated dataset that contained constant variables (sd = 0) in addition to continuous random variables and did not have a problem:

set.seed(3)
num.vars<-50
num.obs<-200

#continuous random variables, corrrelated
cov.mat<-genPositiveDefMat(num.vars,covMethod=c("unifcorrmat"))$Sigma rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat) #constant variables, sd=0 const.vals<-runif(50,-10,10) const.vals<-matrix(const.vals,nrow=num.obs,ncol=length(const.vals),byrow=T) #append all variables, randomize column order rand.vars<-cbind(rand.vars,const.vals) rand.vars<-rand.vars[,sample(1:ncol(rand.vars))] #run function vif_func(in_frame=rand.vars,thresh=5,trace=T)  Is there some way you could post a sample dataset that causes the error? I suspect the problem is not major but I can’t fix it unless I can reproduce the error. • Nara says: Same error. Error in if (vif_max < thresh) { : missing value where TRUE/FALSE needed To debug I put a print commans in ur function vif_func at these positions.: #get initial vif value for all comparisons of variables vif_init<-NULL var_names <- names(in_frame) for(val in var_names){ regressors <- var_names[-which(var_names == val)] form <- paste(regressors, collapse = '+') form_in <- formula(paste(val, '~', form)) vif_init<-rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, …)))) } print (vif_init[,2]) vif_max<-max(as.numeric(vif_init[,2])) print(vif_max) ————————————————————————– vif_init[,2] had a lot of Inf and NaN values. vif_max was NaN. Link to the data set: https://s3.amazonaws.com/fuel-efficiency-modeling/CAX_Train_1.zip • Hi Nara, Can you please post some code for the dataset you linked to reproduce the error? Thanks, Marcus • Err says: The error: Error in if (vif_max < thresh) { : missing value where TRUE/FALSE needed Will occur if the variables include non-numeric (i.e. nominal variables). Compliments to the author🙂 • Ori says: I also had this problem when I tried to use LOO validation and VIF, and the one row that I excluded made one of my regressors to be a fixed column. I did not used factors, only numeric. Think about a case where a binary column with 100 rows, has a 99 ‘1’ and 1 ‘0’. It will not work. After I switch some of the ‘1’ to become ‘0’ (not many, just to even it up a bit), it worked. • Jason says: Hi, this function is very cool. However, I have a question with your VIF function. According to the fmsb VIF manual, when calculating the VIF for the val in the for loop, the lm function should not be regressed on the whole variable set (your function is form_in<-formula(paste(val,' ~ .')) VIF(lm(form_in,data=in_frame,…) ) But rather should be on all the other variables, excluding the val ( var_names <-names(in_frame) regressors <- var_names[- which(var_names == val)] form <- paste(regressors, collapse = '+') form_in <- formula(paste(val, '~', form) VIF(lm(form_in,data=in_frame,…) ) Please let me know if I'm wrong or anything. 6. Marcel says: Hi there, great blog and Happy New Years! The function is working fine with my dataset, but when trying to create your sample data, I get an error which is due to the variable/matrix/dataset “y” in your code is only a one column matrix with 200 rows – created in line 74 (y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)) Is this how it should be? And what is supposed to be %*% doing? Hope you have time to answer this. Thanks, M 7. Marcel says: Sorry for bothering, I just found my mistake: The copy I received had in line 77 only lm.dat<-data.frame(y) and not lm.dat<-data.frame(y,rand.vars)… So please delete the following entry as well – all is fine • Thanks for reading and I’m glad you like the blog! 8. Hans says: Thank you so much for this great tool! 9. Thanks so much! I was removing explanatory variables with high VIFs all by hand… This has saved me so much time, and it worked in one go, without hassle. Thumbs up. • You’re welcome! 10. JC says: Hi, Thank you!! Your function worked perfectly with my data 11. HV says: Hi, I am running your code to try to reduce the number of bioclimatic variables that may be collinear for certain species. Sometimes it seems to work fine, but other times I have infinite values until about the last vif run, leaving usually 1 or 2 variables out of 20, when using a threshold of 10. Is this right to have these infinite values for so many of the vif runs? Perhaps my data frame requires transformation first? Any insight you have would be appreciated Thanks • Hi there, An infinite VIF will be returned for two variables that are exactly collinear… variables that are exactly the same or linear transformations of each other. Perhaps check your variables for exact collinearity, using the pairs function or something similar (e.g., ggpairs). -Marcus 12. J says: Hi, Great code, it works well for me, however not well when i have ‘NA’ values. Any idea how i could deal with this? Thanks J • Hi J, I tried running the function with missing data and it works fine for me. The part of the function that handles NA values would be the call to lm, specifically the na.action argument where the default is na.omit. I added an additional argument to the vif_func so you can change this option, though you shouldn’t need to. You could also remove the missing values in your data before using vif_func using na.omit, though you will get a slightly different result since this would remove all rows with incomplete data. Here’s the example I did, note the difference in the final VIF values for the two approaches. # create data set.seed(2) num.vars<-15 num.obs<-200 cov.mat<-genPositiveDefMat(num.vars,covMethod="unifcorrmat")$Sigma
rand.vars<-mvrnorm(num.obs,rep(0,num.vars),Sigma=cov.mat)
parms<-runif(num.vars,-10,10)
y<-rand.vars %*% matrix(parms) + rnorm(num.obs,sd=20)

# add missing values to the data
nr <- nrow(rand.vars)
nc <- ncol(rand.vars)
rand.vars2 <- c(rand.vars)
rand.vars2[sample(1:length(rand.vars2), 100)] <- NA
rand.vars2 <- matrix(rand.vars2, ncol = nc, nrow = nr)

# use function by removing missing values first
vif_func(in_frame=na.omit(rand.vars2),thresh=5,trace=T)

# run on whole dataset, default for na.action is na.omit, so you don't have to specify
vif_func(in_frame=rand.vars2,thresh=5,trace=T, na.action = na.omit)

13. Mark says:

Hi,
I am running your code and with your data it works fine.
Now I have a data.frame with num and int Variables.
And I am getting the message:
“Error in if (vif_max < thresh) { : missing value where TRUE/FALSE needed"
But there are only numbers, I dont have factors defined or characters.
Do you have any idea, what might be the reason?
The integer values are (0/1)-Results from questions, but I only use the numbers not the translation (no/yes).
Any idea would be very apprechiated
THX

• Hi Mark,

Could you provide some example data that causes the problem? This is hard to diagnose if I can’t reproduce the issue.

Thanks,

Marcus

14. I am getting memory errors that are similar to ones discussed in http://stackoverflow.com/questions/25672974/caught-segfault-error-in-r which I have copied here. Do you have any workaround on this ?

*** caught segfault ***
address 0x7f8e32c2d554, cause ‘memory not mapped’

Traceback:
1: terms.formula(formula, data = data)
2: terms(formula, data = data)
3: model.frame.default(formula = form_in, data = in_frame, drop.unused.levels = TRUE)
4: stats::model.frame(formula = form_in, data = in_frame, drop.unused.levels = TRUE)
5: eval(expr, envir, enclos)
6: eval(mf, parent.frame())
7: lm(form_in, data = in_frame, …)
8: summary(X)
9: VIF(lm(form_in, data = in_frame, …))
10: rbind(vif_init, c(val, VIF(lm(form_in, data = in_frame, …))))

• Hm, I’ve never seen that error before. I don’t know what the issue is but it may help if the vectors in the function are pre-allocated. I was lazy when I wrote the initial function so the output for each loop simply makes a new copy with each iteration – an inefficient method, expecially in R. Check out this blog for an explanation. I rewrote the vif function with pre-allocated vectors:

#stepwise VIF function with preallocated vectors
vif_func<-function(in_frame,thresh=10,trace=T,...){

require(fmsb)

if(class(in_frame) != 'data.frame') in_frame<-data.frame(in_frame)

#get initial vif value for all comparisons of variables
vif_init<-vector('list', length = ncol(in_frame))
names(vif_init) <- names(in_frame)
var_names <- names(in_frame)

for(val in var_names){
regressors <- var_names[-which(var_names == val)]
form <- paste(regressors, collapse = '+')
form_in<-formula(paste(val,' ~ .'))
vif_init[[val]]<-VIF(lm(form_in,data=in_frame,...))
}
vif_max<-max(unlist(vif_init))

if(vif_max < thresh){
if(trace==T){ #print output of each iteration
prmatrix(vif_init,collab=c('var','vif'),rowlab=rep('',nrow(vif_init)),quote=F)
cat('\n')
cat(paste('All variables have VIF < ', thresh,', max VIF ',round(vif_max,2), sep=''),'\n\n')
}
return(names(in_frame))
}
else{

in_dat<-in_frame

#backwards selection of explanatory variables, stops when all VIF values are below 'thresh'
while(vif_max >= thresh){

vif_vals<-vector('list', length = ncol(in_dat))
names(vif_vals) <- names(in_dat)
var_names <- names(in_dat)

for(val in var_names){
regressors <- var_names[-which(var_names == val)]
form <- paste(regressors, collapse = '+')
form_in<-formula(paste(val,' ~ .'))
}

max_row<-which.max(vif_vals)[1]

vif_max<-vif_vals[max_row]

if(vif_max<thresh) break

if(trace==T){ #print output of each iteration
vif_vals <- do.call('rbind', vif_vals)
vif_vals
prmatrix(vif_vals,collab='vif',rowlab=row.names(vif_vals),quote=F)
cat('\n')
cat('removed: ', names(vif_max),unlist(vif_max),'\n\n')
flush.console()
}

in_dat<-in_dat[,!names(in_dat) %in% names(vif_max)]

}

return(names(in_dat))

}

}


Hope that helps.

-Marcus

• Yep, that was probably the issue. Glad you figured it out though.

15. Darwin says:

Perfect, thank you for sharing!

16. Salej Sood says:

Can someone help in analysis of sequential path analysis using grain yield as dependent variable and its component traits as independent variables in R

17. First of all thanks to share your code! And man.. great ideia to do something like this. I enjoy so much with the method, congratulations for the method!

18. Alejandro Castro says:

Hi,

I just wondered if this VIF could help in the feature selection for classification type-models (neural networks) or is just limited to regression type-models?

Thanks,

Alejandro

• Hi Alejandro,

The VIF selection could help with fitting neural network models. It will basically tell you which variables are collinear and if they should be removed before creating a model. Collinearity is a problem in most models, not limited to regression. You could try running VIF selection to identify/remove collinear variables prior to creating a neural network. Alternatively, the RSNNS package provides a method for ‘pruning’ a neural network to remove variables that have little influence in the model. I suppose this is similar to variable selection with VIF because it might remove collinear variables, in addition to variables that aren’t related to the response. You could try it both ways to see if you get similar results. There is an example in the help file for the plotnet function that shows how to create and plot a pruned network. Again, you’ll have to download the development version to see this.

-Marcus

19. Jason says:

Hi, this function is very cool.
However, I have a question with your VIF function. According to the fmsb VIF manual, when calculating the VIF for the val in the for loop, the lm function should not be regressed on the whole variable set (your function is
form_in<-formula(paste(val,' ~ .'))
VIF(lm(form_in,data=in_frame,…)
)
But rather should be on all the other variables, excluding the val (
var_names <-names(in_frame)
regressors <- var_names[- which(var_names == val)]
form <- paste(regressors, collapse = '+')
form_in <- formula(paste(val, '~', form)
VIF(lm(form_in,data=in_frame,…)
)
Please let me know if I'm wrong or anything.

• Ah, you’re absolutely correct. Thanks for pointing that out. I’ve made changes to the function.

20. Hi,
I took the mtcars data set.
After that, i used VIF and it eliminated disp, wt, hp, cyl
I ran the model now and i got AdjR2= 0.773.

Here is the code. Not sure why R2 value has dropped after applying VIF. Am i missing something?

#Prior to VIF
summary(lm(mpg~., data=mtcars))
Multiple R-squared: 0.869, Adjusted R-squared: 0.8066

#call the function
vif_func<-function(in_frame,thresh=10,trace=T,…){

}

#Use the VIF
require(MASS)
require(clusterGeneration)
require(fmsb)
x <- mtcars[,2:11]
y <- mtcars[,1]
keep.dat <- vif_func(in_frame=x,thresh=5,trace=T)
form.in<-paste('y ~',paste(keep.dat,collapse='+'))
fit<-lm(form.in,data=mtcars)
summary(fit)
Multiple R-squared: 0.8173, Adjusted R-squared: 0.7735

• Hi Rajagopalan,

I think a more general problem that might be causing confusion is the type of information desired from a regression model. I would imagine that using more variables could lead to higher R2 values, particularly for collinear variables, as you would be overfitting the model and lowering the residual error. However, your ability to describe relationships between the response variable and each individual explanatory would of course be compromised. I would suggest that removing collinear variables is more important for understanding relationships with the response variable than having large effects on the explained variance. The change is relatively small and both models are not statistically different so I wouldn’t worry too much about minor changes in R2. You might also notice that changing the threshold to 10 raises the R2 value for the second model.

Comparison of the two models shows that they are not statistically different:

library(MASS)

#Prior to VIF
x <- mtcars[,2:11]
y <- mtcars[,1]
form.in<-paste('y ~',paste(names(x),collapse='+'))
fit1 <- lm(form.in, data=mtcars)

# after VIF
keep.dat <- vif_func(in_frame=x,thresh=5,trace=T)
form.in<-paste('y ~',paste(keep.dat,collapse='+'))
fit2 <-lm(form.in,data=mtcars)

# compare
anova(fit1, fit2)


-Marcus

21. Hi,
I tried this VIF using the above code on the longley dataset.

Case 1. Without VIF, model showed Population, GNP, GNP.deflator not statistically significant by looking at the p-value.
lm(formula = Employed ~ ., data = longley)

Multiple R-squared: 0.9955, Adjusted R-squared: 0.9925

Case 2. I tried the VIF using the above function, It has removed GNP, GNP.deflator and Year. Whereas the Year variable was highly significant without VIF, p-value was 0.003037.

#If VIF is more than 10, multicolinearity is strongly suggested.
VIF(lm(Employed~., data=longley))
VIF is 221 using fmsb package.

keep.dat <- vif_func(in_frame=longley[,-7],thresh=5,trace=T)
form.in<-paste('Employed ~',paste(keep.dat,collapse='+'))
form.in
#Build the model
fit<-lm(form.in,data=longley)
summary(fit)

Multiple R-squared: 0.9696, Adjusted R-squared: 0.962 (using usdm pkg)
Multiple R-squared: 0.9696, Adjusted R-squared: 0.962 (using fmsb pkg)

Questions:
1. Why the VIF removed the Year while doing VIF, since it was highly significant without applying VIF.

2. When there are multi-co linearity between two predictors, should we not remove one and retain the other. Here it seems to be removing both the variables.

Thanks,
Raj

• Hi Raj,

Have a look at a pairs plot for the longley dataset:

pairs(longley)


The VIF function removed all variables except Unemployed, Armed.Foreces, and Population. You’ll notice that the removed variables (Year, GNP.deflator, and GNP) are all collinear with Population. According to the VIF selection criteria, only Population was retained among the four collinear variables.

To answer your first question, year was shown as highly significant in the first model but population was not. After removing year, population is now significant because the two were collinear. This shows that you can get different conclusions from a model if variables are redundant. That’s not to say you can link causal relationships between any of the variables but collinearity can lead to different significance values by including all in the model.

For the second question, both GNP and GNP.deflator were also correlated with Population and, therefore, removed. Again, redundancy among variables lead to different conclusions in the original model.

Hope that helps.

-Marcus

22. Hi Marcus, first of all thank you very much….this blog is REALLY AMAZING. I just wanted to ask you whether there may be any issue in using this function with 507 variables each of length 127. I expect about 40% of the variables to be highly correlated among each other. I did not get any error message but I didn’t get any output either..I have tried to reduce the sample to 100 variables and results show up after about 1.5 minutes. I have tried 3 time with the full sample and after about 4 hours I still didn’t get any results, so I sarted wondering whether I am doing something wrong. Do you have any idea?

Thank you very much again,
Matteo

• Hi Matteo,

I have not tested it with very large datasets but my guess is you’re having problems with sample size. The function runs stepwise regressions for each variable to estimate the VIF. Looking at your sample size vs number of variables, the function would be estimating regressions where the number of observations is much less than the number of variables. This is basically trying to fit regressions with insufficient degrees of freedom to estimate the model parameters. I don’t think you can use the function with this dataset, unless you had a substantially larger sample size.

-Marcus

23. Bani Tamraz says:

Thank you so much for this great too!! It works beautifully!

24. Srihita says:

Hi, I see that we can’t use the above function for categorical variables. Please tell me if there is a method to check for collinearity in categorical variables as well.

• Hi Srihita,

The variance inflation factor for each variable is from the R2 value of the linear regression of each variable as a function of all other variables. It doesn’t make sense to estimate these values for categorical variables, unless they can be modeled with a GLM with a known distribution family for each categorical variable (e.g., Poisson, Binomial). I don’t think the VIF function in the fmsb package is meant to handle these types of data. I think you would be better off just looking at pairs plots of the explanatory variables to see if there are any patterns worth noting, then deciding for yourself which variables to include. The ‘ggpairs’ plot of GGally can plot both continuous and categorical data in a pairs plot. Hope that helps.

-Marcus

25. Alexandre Santos says:

There are number of variable number? Beacuse I try to use with 1555 variables and vif_func() function doesn’t work. Thanks

• There shouldn’t be any limit. How many observations do you have for each variable? You should have much more observations than variables.