I am currently trying to make my code dryer by rewriting some parts with the help of functions. One of the functions I am using is:
datasetperuniversity<-function(university,year){assign(paste("data",university,sep=""),subset(get(paste("originaldata",year,sep="")),get(paste("allcollaboration",university,sep=""))==1))}
Executing the function datasetperuniversity("Harvard","2000") would result within the function in something like this:
dataHarvard=subset(originaldata2000,allcollaborationHarvard==1)
The function runs nearly perfectly, except that it does not store a the results in dataHarvard. I read that this is normal in functions, and using the <<- instead of the = could solve this issue, however since I am making use of the assign function this is not really possible, since the = is just the outcome of the assign function.
Here some data:
sales = c(2, 3, 5,6)
numberofemployees = c(1, 9, 20,12)
allcollaborationHarvard = c(0, 1, 0,1)
originaldata = data.frame(sales, numberofemployees, allcollaborationHarvard)
Generally, it's best not to embed data/a variable into the name of an object. So instead of using assign
to dataHarvard
, make a list data
with an element called "Harvard":
# enumerate unis, attaching names for lapply to use
unis = setNames(, "Harvard")
# make a table for each subset with lapply
data = lapply(unis, function(x)
originaldata[originaldata[[ paste0("allcollaboration", x) ]] == 1, ]
)
which gives
> data
$Harvard
sales numberofemployees allcollaborationHarvard
2 3 9 1
4 6 12 1
As seen here, you can use DF[["column name"]]
to access a column instead of get
as in the OP. Also, see the note in ?subset
:
Warning
This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like
[
, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.
Generally, it's also better not to embed data in column names if possible. If the allcollaboration*
columns are mutually exclusive, they can be collapsed to a single categorical variable with values like "Harvard", "Yale", etc. Alternately, it might make sense to put the data in long form.
For more guidance on arranging data, I recommend Hadley Wickham's tidy data paper.