Search code examples
raggregatetime-seriessummary

Summary of data for each year in R


I have a data with two columns. In one column it is date and in another column it is flow data.

I was able to read the data as date and flow data. I used the following code:

creek <- read.csv("creek.csv")
library(ggplot2)
creek[1:10,]
colnames(creek) <- c("date","flow")
creek$date <- as.Date(creek$date, "%m/%d/%Y")

The link to my data is https://www.dropbox.com/s/eqpena3nk82x67e/creek.csv

Now, I want to find the summary of each year. I want to especially know mean, median, maximum etc.

Thanks.

Regards, Jdbaba


Solution

  • Base R

    Here are two methods from base R.

    The first uses cut, split and lapply along with summary.

    creekFlowSummary <- lapply(split(creek, cut(creek$date, "1 year")), 
                               function(x) summary(x[2]))
    

    This creates a list. You can view the summaries of different years by accessing the corresponding list index or name.

    creekFlowSummary[1]
    # $`1999-01-01`
    #       flow       
    #  Min.   :0.3187  
    #  1st Qu.:0.3965  
    #  Median :0.4769  
    #  Mean   :0.6366  
    #  3rd Qu.:0.5885  
    #  Max.   :7.2560  
    # 
    creekFlowSummary["2000-01-01"]
    # $`2000-01-01`
    #       flow       
    #  Min.   :0.1370  
    #  1st Qu.:0.1675  
    #  Median :0.2081  
    #  Mean   :0.2819  
    #  3rd Qu.:0.2837  
    #  Max.   :2.3800  
    

    The second uses aggregate:

    aggregate(flow ~ cut(date, "1 year"), creek, summary)
    #    cut(date, "1 year") flow.Min. flow.1st Qu. flow.Median flow.Mean flow.3rd Qu. flow.Max.
    # 1           1999-01-01    0.3187       0.3965      0.4770    0.6366       0.5885    7.2560
    # 2           2000-01-01    0.1370       0.1675      0.2081    0.2819       0.2837    2.3800
    # 3           2001-01-01    0.1769       0.2062      0.2226    0.2950       0.2574    2.9220
    # 4           2002-01-01    0.1279       0.1781      0.2119    0.5346       0.4966   14.3900
    # 5           2003-01-01    0.3492       0.4761      0.7173    1.0350       1.0840   10.1500
    # 6           2004-01-01    0.4178       0.5379      0.6524    0.9691       0.9020   11.7100
    # 7           2005-01-01    0.4722       0.6094      0.7279    1.2340       1.0900   17.7200
    # 8           2006-01-01    0.2651       0.3275      0.4282    0.5459       0.5758    3.3510
    # 9           2007-01-01    0.2784       0.3557      0.4041    0.6331       0.6125    9.6290
    # 10          2008-01-01    0.4131       0.5430      0.6477    0.8792       0.9540    4.5960
    # 11          2009-01-01    0.3877       0.4572      0.5945    0.8465       0.8309    6.3830
    

    Be careful with the aggregate solution though: All of the summary information is a single matrix. View str on the output to see what I mean.

    xts

    There are, of course other ways to do this. One way is to use the xts package.

    First, convert your data to xts:

    library(xts)
    creekx <- xts(creek$flow, order.by=creek$date)
    

    Then, use apply.yearly and whatever functions you are interested in.

    Here is the yearly mean:

    apply.yearly(creekx, mean)
    #                 [,1]
    # 1999-12-31 0.6365604
    # 2000-12-31 0.2819057
    # 2001-12-31 0.2950348
    # 2002-12-31 0.5345666
    # 2003-12-31 1.0351742
    # 2004-12-31 0.9691180
    # 2005-12-31 1.2338066
    # 2006-12-31 0.5458652
    # 2007-12-31 0.6331271
    # 2008-12-31 0.8792396
    # 2009-09-30 0.8465300
    

    And the yearly maximum:

    apply.yearly(creekx, max)
    #              [,1]
    # 1999-12-31  7.256
    # 2000-12-31  2.380
    # 2001-12-31  2.922
    # 2002-12-31 14.390
    # 2003-12-31 10.150
    # 2004-12-31 11.710
    # 2005-12-31 17.720
    # 2006-12-31  3.351
    # 2007-12-31  9.629
    # 2008-12-31  4.596
    # 2009-09-30  6.383
    

    Or, put them together like this: apply.yearly(creekx, function(x) cbind(mean(x), sum(x), max(x)))

    data.table

    The data.table package may also be of interest for you, particularly if you are dealing with a lot of data. Here's a data.table approach. The key is to use as.IDate on your "date" column while you are reading your data in:

    library(data.table)
    DT <- data.table(date = as.IDate(creek$date), creek[-1])
    DT[, list(mean = mean(flow),
              tot = sum(flow),
              max = max(flow)), 
       by = year(date)]
    #     year      mean      tot    max
    #  1: 1999 0.6365604 104.3959  7.256
    #  2: 2000 0.2819057 103.1775  2.380
    #  3: 2001 0.2950348 107.6877  2.922
    #  4: 2002 0.5345666 195.1168 14.390
    #  5: 2003 1.0351742 377.8386 10.150
    #  6: 2004 0.9691180 354.6972 11.710
    #  7: 2005 1.2338066 450.3394 17.720
    #  8: 2006 0.5458652 199.2408  3.351
    #  9: 2007 0.6331271 231.0914  9.629
    # 10: 2008 0.8792396 321.8017  4.596
    # 11: 2009 0.8465300 231.1027  6.383