Search code examples
rdataframeaggregatesummary

summarize data from csv using R


I'm new to R, and I wrote some code to summarize data from .csv file according to my needs.

here is the code.

raw <- read.csv("trees.csv")

looks like this

                                 SNAME     CNAME        FAMILY PLOT INDIVIDUAL CAP   H
1 Alchornea triplinervia (Spreng.) M. Arg. Tainheiro Euphorbiaceae    5        176  15 9.5
2               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        321  12 6.0
3               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        326  14 7.0
4               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        327  18 5.0
5               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        328  12 6.0
6               Andira fraxinifolia Benth.   Angelim      Fabaceae    3        329  21 7.0

#add 2 other rows
for (i in 1:nrow(raw)) {
  raw$VOLUME[i] <- treeVolume(raw$CAP[i],raw$H[i])  
  raw$BASALAREA[i] <- treeBasalArea(raw$CAP[i])
}

#here comes. I need a new data frame, with the mean of columns H and CAP and the sums of columns VOLUME and BASALAREA. This dataframe is grouped by column SNAME and subgrouped by column PLOT.

plotSummary = merge(
  aggregate(raw$CAP ~ raw$SNAME * raw$PLOT, raw, mean),
  aggregate(raw$H ~ raw$SNAME * raw$PLOT, raw, mean))

plotSummary = merge(
  plotSummary,
  aggregate(raw$VOLUME ~ raw$SNAME * raw$PLOT, raw, sum))


plotSummary = merge(
  plotSummary,
  aggregate(raw$BASALAREA ~ raw$SNAME * raw$PLOT, raw, sum))

The functions treeVolume and treeBasal area just return numbers.

treeVolume <- function(radius, height) {
  return (0.000074230*radius**1.707348*height**1.16873)
}

treeBasalArea <- function(radius) {
  return (((radius**2)*pi)/40000)
}

I'm sure that there is a better way of doing this, but how?


Solution

  • I can't manage to read your example data in, but I think I've made something that generally represents it...so give this a whirl. This answer builds off of Greg's suggestion to look at plyr and the functions ddply to group by segments of your data.frame and numcolwise to calculate your statistics of interest.

    #Sample data
    set.seed(1)
    dat <- data.frame(sname = rep(letters[1:3],2), plot = rep(letters[1:3],2), 
                      CAP = rnorm(6), 
                      H = rlnorm(6), 
                      VOLUME = runif(6),
                      BASALAREA = rlnorm(6)
                      )
    
    
    #Calculate mean for all numeric columns, grouping by sname and plot
    library(plyr)
    ddply(dat, c("sname", "plot"), numcolwise(mean))
    #-----
      sname plot        CAP        H    VOLUME BASALAREA
    1     a    a  0.4844135 1.182481 0.3248043  1.614668
    2     b    b  0.2565755 3.313614 0.6279025  1.397490
    3     c    c -0.8280485 1.627634 0.1768697  2.538273
    

    EDIT - response to updated question

    Ok - now that your question is more or less reproducible, here's how I'd approach it. First of all, you can take advantage of the fact that R is a vectorized meaning that you can calculate ALL of the values from VOLUME and BASALAREA in one pass, without looping through each row. For that bit, I recommend the transform function:

    dat <- transform(dat, VOLUME = treeVolume(CAP, H), BASALAREA = treeBasalArea(CAP))
    

    Secondly, realizing that you intend to calculate different statistics for CAP & H and then VOLUME & BASALAREA, I recommend using the summarize function, like this:

    ddply(dat, c("sname", "plot"), summarize,
      meanCAP = mean(CAP),
      meanH = mean(H),
      sumVOLUME = sum(VOLUME),
      sumBASAL = sum(BASALAREA)
      )
    

    Which will give you an output that looks like:

      sname plot   meanCAP     meanH    sumVOLUME     sumBASAL
    1     a    a 0.5868582 0.5032308 9.650184e-06 7.031954e-05
    2     b    b 0.2869029 0.4333862 9.219770e-06 1.407055e-05
    3     c    c 0.7356215 0.4028354 2.482775e-05 8.916350e-05
    

    The help pages for ?ddply, ?transform, ?summarize should be insightful.