R code: Extracting highly correlated variables and Running multivariate
regression model with selected variables
I have a huge data which has about 2,000 variables and about 10,000
observations. Initially, I wanted to run a regression model for each one
with 1999 independent variables and then do stepwise model selection.
Therefore, I would have 2,000 models.
However, unfortunately R presented errors because of lack of memory.. So,
alternatively, I have tried to remove some independent variables which are
low correlation value- maybe lower than .5-
With variables which are highly correlated with each dependent variable, I
would like to run regression model..
I tried to do follow codes, even "melt"function doesn't work because of
memory issue.. oh god..
test<-data.frame(X1=rnorm(50,mean=50,sd=10), X2=rnorm(50,mean=5,sd=1.5),
X3=rnorm(50,mean=200,sd=25))
test$X1[10]<-5
test$X2[10]<-5
test$X3[10]<-530
corr<-cor(test)
diag(corr)<-NA
corr[upper.tri(corr)]<-NA
melt(corr)
#it doesn't work with my own data..because of lack of memory.
Please help me.. and thank you so much in advance..!
0
ReplyDeleteIn such a situation if might be worth trying sparsity inducing techniques such as the Lasso. Here a sparse subset of variables is selected by constraining the sum of absolute values of the regression coefficients.
This will give you a reduced subset of variables which are the most relevant (and due to the nature of the Lasso algorithm also the most correlated, which was what you were looking for)
In R you can use the LARS package and information about the Lasso can be found here: http://www-stat.stanford.edu/~tibs/lasso.html
Also a very good resource is: http://www-stat.stanford.edu/~tibs/ElemStatLearn/