Search code examples
rplotdataframecumsum

R: Applying cumulative sum function and filling data gaps with NA for plotting


I have a dataframe which looks like this and I am trying to calculate the cumulative sum for the row VALUE. The input file can also be found here: https://dl.dropboxusercontent.com/u/16277659/input.csv

df <-read.csv("input.csv", sep=";", header=TRUE)

NAME;       ID; SURVEY_YEAR REFERENCE_YEAR; VALUE
SAMPLE1;    253;    1880;   1879;           14
SAMPLE1;    253;    1881;   1880;           -10
SAMPLE1;    253;    1882;   1881;           4
SAMPLE1;    253;    1883;   1882;           10
SAMPLE1;    253;    1884;   1883;           10
SAMPLE1;    253;    1885;   1884;           12
SAMPLE1;    253;    1889;   1888;           11
SAMPLE1;    253;    1890;   1889;           12
SAMPLE1;    253;    1911;   1910;          -16
SAMPLE1;    253;    1913;   1911;          -11
SAMPLE1;    253;    1914;   1913;          -8
SAMPLE2;    261;    1992;   1991;          -19
SAMPLE2;    261;    1994;   1992;          -58
SAMPLE2;    261;    1995;   1994;          -40
SAMPLE2;    261;    1996;   1995;          -21
SAMPLE2;    261;    1997;   1996;          -50
SAMPLE2;    261;    1998;   1997;          -60
SAMPLE2;    261;    2005;   2004;          -34
SAMPLE2;    261;    2006;   2005;          -23
SAMPLE2;    261;    2007;   2006;          -19
SAMPLE2;    261;    2008;   2007;          -29
SAMPLE2;    261;    2009;   2008;          -89
SAMPLE2;    261;    2013;   2009;          -14
SAMPLE2;    261;    2014;   2013;          -16

The end product I am aiming for are plots for each SAMPLE where on the x axis the SURVEY_YEAR is plotted and on the y axis the later calculated cumulative sum CUMSUM of the VALUE. My code so far to sort out the data:

# Filter out all values with less than 3 measurements by group (in this case does nothing, but is important with the rest of my data)
df <-read.csv("input.csv", sep=";", header=TRUE)
rowsn <- with(df,by(VALUE,ID,function(xx)sum(!is.na(xx))))
names(which(rowsn>=3))
dat <- df[df$ID %in% names(which(rowsn>=3)),]

# write new column which defines the beginning of the group (split by ID) and for the cumsum function(=0)
dat <- do.call(rbind, lapply(split(dat, dat$ID), function(x){
x <- rbind(x[1,],x); x[1, "VALUE"] <- 0; x[1, "SURVEY_YEAR"] <- x[1, "SURVEY_YEAR"] -1;       return(x)}))
rownames(dat) <- seq_len(nrow(dat))

# write dat to csv file for inspection
write.table(dat, "dat.csv", sep=";", row.names=FALSE)

This results in the following dataframe which is the starting point for the calculation of the cumulative sum of the row VALUE.

NAME;   ID; SURVEY_YEAR;    REFERENCE_YEAR; VALUE
SAMPLE1;    253;    1879;   1879;             0
SAMPLE1;    253;    1880;   1879;            14
SAMPLE1;    253;    1881;   1880;           -10
SAMPLE1;    253;    1882;   1881;             4
SAMPLE1;    253;    1883;   1882;            10
SAMPLE1;    253;    1884;   1883;            10
SAMPLE1;    253;    1885;   1884;            12
SAMPLE1;    253;    1889;   1888;            11
SAMPLE1;    253;    1890;   1889;            12
SAMPLE1;    253;    1911;   1910;           -16
SAMPLE1;    253;    1913;   1911;           -11
SAMPLE1;    253;    1914;   1913;            -8
SAMPLE2;    261;    1991;   1991;             0
SAMPLE2;    261;    1992;   1991;           -19
SAMPLE2;    261;    1994;   1992;           -58
SAMPLE2;    261;    1995;   1994;           -40
SAMPLE2;    261;    1996;   1995;           -21
SAMPLE2;    261;    1997;   1996;           -50
SAMPLE2;    261;    1998;   1997;           -60
SAMPLE2;    261;    2005;   2004;           -34
SAMPLE2;    261;    2006;   2005;           -23
SAMPLE2;    261;    2007;   2006;           -19
SAMPLE2;    261;    2008;   2007;           -29
SAMPLE2;    261;    2009;   2008;           -89
SAMPLE2;    261;    2013;   2009;           -14
SAMPLE2;    261;    2014;   2013;           -16

The problem now is that I would like to calculate the cumulative sum of the row VALUE for each year. As you can see I have gaps between certain years (for example in SAMPLE1 between 1890 and 1911 and in SAMPLE2 between 1998 and 2005) and I would like to fill the gaps for each year inbetween with NA values so that I can plot with plot type='b' (points and lines) and so that the different gaps are not connected. What is important that if there are multiple NA values after each other, in the CUMSUM row the last NA value should be replaced with the last numerical value before..

The normal case is that the difference between the REFERENCE_YEAR and the SURVEY_YEAR equals 1 (e.g for the first example of SAMPLE1 from 1880 to 1881), but in some cases there are varying periods between the REFERENCE_YEAR and the SURVEY_YEAR (e.g. in SAMPLE1 from 1911 to 1913 and in SAMPLE2 from 2009 to 2013). If this is the case the function of cumulative sum should only be applied once and the value should stay the same for the period indicated (in the plot this results in a straight line that is connected).

Its difficult to explain everything in detail and maybe its easier if I provide an example of what the result should look like:

NAME;       ID; SURVEY_YEAR;    REFERENCE_YEAR; VALUE;  CUMSUM
SAMPLE1;    253;    1879;       1879;            0;     0
SAMPLE1;    253;    1880;       1879;           14;     14
SAMPLE1;    253;    1881;       1880;          -10;     4
SAMPLE1;    253;    1882;       1881;            4;     8
SAMPLE1;    253;    1883;       1882;           10;     18
SAMPLE1;    253;    1884;       1883;           10;     28
SAMPLE1;    253;    1885;       1884;           12;     40
SAMPLE1;    253;    1886;       1885;           NA;     NA
SAMPLE1;    253;    1887;       1886;           NA;     NA
SAMPLE1;    253;    1888;       1887;           NA;     40
SAMPLE1;    253;    1889;       1888;           11;     51
SAMPLE1;    253;    1890;       1889;           12;     63
SAMPLE1;    253;    1891;       1890;           NA;     NA
SAMPLE1;    253;    1892;       1891;           NA;     NA
SAMPLE1;    253;    1893;       1892;           NA;     NA
SAMPLE1;    253;    1894;       1893;           NA;     NA
SAMPLE1;    253;    1895;       1894;           NA;     NA
SAMPLE1;    253;    1896;       1895;           NA;     NA
SAMPLE1;    253;    1897;       1896;           NA;     NA
SAMPLE1;    253;    1898;       1897;           NA;     NA
SAMPLE1;    253;    1899;       1898;           NA;     NA
SAMPLE1;    253;    1900;       1899;           NA;     NA
SAMPLE1;    253;    1901;       1900;           NA;     NA
SAMPLE1;    253;    1902;       1901;           NA;     NA
SAMPLE1;    253;    1903;       1902;           NA;     NA
SAMPLE1;    253;    1904;       1903;           NA;     NA
SAMPLE1;    253;    1905;       1904;           NA;     NA
SAMPLE1;    253;    1906;       1905;           NA;     NA
SAMPLE1;    253;    1907;       1906;           NA;     NA
SAMPLE1;    253;    1908;       1907;           NA;     NA
SAMPLE1;    253;    1909;       1908;           NA;     NA
SAMPLE1;    253;    1910;       1909;           NA;     63
SAMPLE1;    253;    1911;       1910;          -16;     47
SAMPLE1;    253;    1912;       1911;          -11;     36
SAMPLE1;    253;    1913;       1912;          -11;     36
SAMPLE1;    253;    1914;       1913;           -8;     28
SAMPLE2;    253;    1991;       1991;            0;     0
SAMPLE2;    253;    1992;       1991;          -19;     -19
SAMPLE2;    253;    1993;       1992;          -58;     -77
SAMPLE2;    253;    1994;       1993;          -58;     -135
SAMPLE2;    253;    1995;       1994;          -40;     -175
SAMPLE2;    253;    1996;       1995;          -21;     -196
SAMPLE2;    253;    1997;       1996;          -50;     -246
SAMPLE2;    253;    1998;       1997;          -60;     -306
SAMPLE2;    253;    1999;       1998;           NA;     NA
SAMPLE2;    253;    2000;       1999;           NA;     NA
SAMPLE2;    253;    2001;       2000;           NA;     NA
SAMPLE2;    253;    2002;       2001;           NA;     NA
SAMPLE2;    253;    2003;       2002;           NA;     NA
SAMPLE2;    253;    2004;       2003;           NA;     -306
SAMPLE2;    253;    2005;       2004;          -34;     -340
SAMPLE2;    253;    2006;       2005;          -23;     -363
SAMPLE2;    253;    2007;       2006;          -19;     -382
SAMPLE2;    253;    2008;       2007;          -29;     -411
SAMPLE2;    253;    2009;       2008;          -89;     -500
SAMPLE2;    253;    2010;       2009;          -14;     -514
SAMPLE2;    253;    2011;       2010;          -14;     -514
SAMPLE2;    253;    2012;       2011;          -14;     -514
SAMPLE2;    253;    2013;       2012;          -14;     -514
SAMPLE2;    253;    2014;       2013;          -16;     -530 

Help with this rather complicated case would be very much appreciated! Thank you!


Solution

  • BIG EDIT: Posted code, added correct library calls

    library(dplyr)
    df = read.csv("input.csv", sep=";", stringsAsFactors=FALSE)
    
    #find min/max year for each SAMPLE
    df_minmax = df %>% 
    group_by(NAME) %>% 
    summarise(min_year = min(SURVEY_YEAR), 
              max_year = max(SURVEY_YEAR))
    
    #create an empty dataframe with what we want
    df2 = data.frame(NAME = "", 
                     ID = 0, 
                     SURVEY_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR), 
                     REFERENCE_YEAR = min(df$SURVEY_YEAR):max(df$SURVEY_YEAR) - 1,
                     VALUE = NA, stringsAsFactors=FALSE)
    
    #fill in the NAMES dataframe - there's probably a better way to do this
    for(i in 1:nrow(df_minmax)) {
      min_year = df_minmax[i, ]$min_year
      max_year = df_minmax[i, ]$max_year
    
      df2[df2$SURVEY_YEAR %in% min_year:max_year, ]$NAME = df_minmax[i, ]$NAME
    }
    
    #fill in the values
    #this line is a bit dangerous -- it relies on the fact that df1 and df2 have the same relative ordering
    #don't change the ordering of df and df2 before this line.
    df2[df2$SURVEY_YEAR %in% df$SURVEY_YEAR, ]$VALUE = df$VALUE
    
    #in this example there is a long period between sample1 and sample2 we can filter those out
    df2 = df2 %>% filter(NAME != "")
    
    #Now we can do all the cumulative stuff
    #for purposes of cumulative sums, set NA to 0
    temp = df2$VALUE
    df2[is.na(df2)] = 0
    df2 = df2 %>% group_by(NAME) %>% mutate(csum = cumsum(VALUE))
    
    #get back the NA values -- in case the NA values are useful to you
    df2$VALUE = temp
    

    Here's `head(df2):

          NAME ID SURVEY_YEAR REFERENCE_YEAR VALUE csum
    1  SAMPLE1  0        1880           1879    14   14
    2  SAMPLE1  0        1881           1880   -10    4
    3  SAMPLE1  0        1882           1881     4    8
    4  SAMPLE1  0        1883           1882    10   18
    5  SAMPLE1  0        1884           1883    10   28
    6  SAMPLE1  0        1885           1884    12   40
    7  SAMPLE1  0        1886           1885    NA   40
    8  SAMPLE1  0        1887           1886    NA   40
    9  SAMPLE1  0        1888           1887    NA   40
    10 SAMPLE1  0        1889           1888    11   51
    11 SAMPLE1  0        1890           1889    12   63
    12 SAMPLE1  0        1891           1890    NA   63
    13 SAMPLE1  0        1892           1891    NA   63
    14 SAMPLE1  0        1893           1892    NA   63
    15 SAMPLE1  0        1894           1893    NA   63
    16 SAMPLE1  0        1895           1894    NA   63
    17 SAMPLE1  0        1896           1895    NA   63
    18 SAMPLE1  0        1897           1896    NA   63
    19 SAMPLE1  0        1898           1897    NA   63
    20 SAMPLE1  0        1899           1898    NA   63
    

    Here's the outline of the steps I did above as a quick summary:

    1. Find the min/max year for each group in NAME.
    2. Create an empty dataframe that has the total range of all the years we want.
    3. Fill in the NAMES in the correct place in new empty dataframe.
    4. Fill in the VALUES in the correct place in new empty dataframe.
    5. Set NA's to 0 for purposes of cumulative sums
    6. Find cumulative sums by group.
    7. Replace the 0 back into NAs.

    It's a bit hackish with the for loop. I'm hoping no one strings me up for it.