Search code examples
rdataframeplyr

Summary statistics using ddply


I like to write a function using ddply that outputs the summary statistics based on the name of two columns of data.frame mat.

  • mat is a big data.frame with the name of columns "metric", "length", "species", "tree", ...,"index"

  • index is factor with 2 levels "Short", "Long"

  • "metric", "length", "species", "tree" and others are all continuous variables

Function:

summary1 <- function(arg1,arg2) {
    ...

    ss <- ddply(mat, .(index), function(X) data.frame(
        arg1 = as.list(summary(X$arg1)),
        arg2 = as.list(summary(X$arg2)),
        .parallel = FALSE)

    ss
}

I expect the output to look like this after calling summary1("metric","length")

Short metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max. 

....

Long metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu. metric.Max. length.Min. length.1st.Qu. length
.Median length.Mean length.3rd.Qu. length.Max.

....

At the moment the function does not produce the desired output? What modification should be made here?

Thanks for your help.


Here is a toy example

mat <- data.frame(
    metric = rpois(10,10), length = rpois(10,10), species = rpois(10,10),
    tree = rpois(10,10), index = c(rep("Short",5),rep("Long",5))
)

Solution

  • As Nick wrote in his answer you can't use $ to reference variable passed as character name. When you wrote X$arg1 then R search for column named "arg1" in data.frame X. You can reference to it either by X[,arg1] or X[[arg1]].

    And if you want nicely named output I propose below solution:

    summary1 <- function(arg1, arg2) {
    
        ss <- ddply(mat, .(index), function(X) data.frame(
            setNames(
                list(as.list(summary(X[[arg1]])), as.list(summary(X[[arg2]]))),
                c(arg1,arg2)
                )), .parallel = FALSE)
    
        ss
    }
    summary1("metric","length")
    

    Output for toy data is:

      index metric.Min. metric.1st.Qu. metric.Median metric.Mean metric.3rd.Qu.
    1  Long           5              7            10         8.6             10
    2 Short           7              7             9         8.8             10
      metric.Max. length.Min. length.1st.Qu. length.Median length.Mean length.3rd.Qu.
    1          11           9             10            11        10.8             12
    2          11           4              9             9         9.0             11
      length.Max.
    1          12
    2          12