Search code examples
rfunctionpasteassign

R Saving function output to object when using assign function


I am currently trying to make my code dryer by rewriting some parts with the help of functions. One of the functions I am using is:

datasetperuniversity<-function(university,year){assign(paste("data",university,sep=""),subset(get(paste("originaldata",year,sep="")),get(paste("allcollaboration",university,sep=""))==1))}

Executing the function datasetperuniversity("Harvard","2000") would result within the function in something like this:

dataHarvard=subset(originaldata2000,allcollaborationHarvard==1)

The function runs nearly perfectly, except that it does not store a the results in dataHarvard. I read that this is normal in functions, and using the <<- instead of the = could solve this issue, however since I am making use of the assign function this is not really possible, since the = is just the outcome of the assign function.

Here some data:

sales = c(2, 3, 5,6) 
numberofemployees = c(1, 9, 20,12) 
allcollaborationHarvard = c(0, 1, 0,1) 
originaldata = data.frame(sales, numberofemployees, allcollaborationHarvard)

Solution

  • Generally, it's best not to embed data/a variable into the name of an object. So instead of using assign to dataHarvard, make a list data with an element called "Harvard":

    # enumerate unis, attaching names for lapply to use
    unis = setNames(, "Harvard")
    
    # make a table for each subset with lapply
    data = lapply(unis, function(x) 
      originaldata[originaldata[[ paste0("allcollaboration", x) ]] == 1, ]
    )
    

    which gives

    > data
    $Harvard
      sales numberofemployees allcollaborationHarvard
    2     3                 9                       1
    4     6                12                       1
    

    As seen here, you can use DF[["column name"]] to access a column instead of get as in the OP. Also, see the note in ?subset:

    Warning

    This is a convenience function intended for use interactively. For programming it is better to use the standard subsetting functions like [, and in particular the non-standard evaluation of argument subset can have unanticipated consequences.

    Generally, it's also better not to embed data in column names if possible. If the allcollaboration* columns are mutually exclusive, they can be collapsed to a single categorical variable with values like "Harvard", "Yale", etc. Alternately, it might make sense to put the data in long form.

    For more guidance on arranging data, I recommend Hadley Wickham's tidy data paper.