Search code examples
raggregatesubsetcustom-function

creating a function and subset of data frame is not working in aggregate function


I have got in to strange problem where a function of aggregate is acting weird if I call it in custom function. It seems to totally over rule the subset function:

To give you gist of what the problem is, I will break it in two parts. 1. without custom function

    c<- data.frame(A = c("carr","bike","truck","carr","truck","bike","bike","carr","truck","carr","truck","truck","carr","truck","truck"),
                B = c(10,20,30,23,45,56,78,44,10,20,30,10,20,30,67),
                D = c(1,2,3,1,2,3,2,3,2,3,2,2,3,2,1))

c_subset<- subset(c,(A=="carr")|(A=="bike"))

dg<- aggregate(B ~ D + A  ,c_subset,max)

the value of dg is:

D   A   B           
2   bike    78
3   bike    56
1   carr    23
3   carr    44

Which is exactly how it should be.

But 2. With custom function:

 rtk <- function(datam,inc_coll,inc_vall,lb,ld){
  datam_subset <- subset(c,inc_coll %in% inc_vall)
  dg1<- aggregate(lb ~ ld + inc_coll,datam_subset,max)

  return(dg1)
}

c_ans <- rtk(c,c$A,c("carr","bike"),c$B,c$D)

The answer is:

ld  inc_coll    lb

2   bike    78
3   bike    56
1   carr    23
3   carr    44
1   truck   67
2   truck   45
3   truck   30

Now I want to know why it is getting "truck" in aggregate function? Although in aggregate function I have used data datam_subset that has been a subset and contains only data on "carr" and "bike".

May be I am missing something very basic. Shall be grateful for your help. Thanks


Solution

  • Passing column names to a function is a question often asked as it can be counterintuitive. Check this question:Pass a data.frame column name to a function A better way to write your function would be to pass to rtk the column names instead of the columns themselves and then use them for what you want to do:

    rtk <- function(datam,inc_coll,inc_vall,lb,ld){
    ## Access the column using df[[colname]] to do the subset
      datam_subset <- subset(c,c[[inc_coll]] %in% inc_vall);
    ## Define the formula you will use in the aggregate function
    f=as.formula(paste0(lb,"~",ld,"+", inc_coll))
    ## Perform the aggregation
      dg1<- aggregate(f,datam_subset,max);
      return(dg1)
    }
    

    Then call it appropriately using column names:

    c_ans <- rtk(c,"A",c("carr","bike"),"B","D")
    

    Which gives you:

    D    A  B
    1 2 bike 78
    2 3 bike 56
    3 1 carr 23
    4 3 carr 44