Search code examples
rplyrseq

creating index conditioned on value in other column; differences over time


I am struggling with the following problem: The dataframe below contains the development of a value over time for various ids. What i try to get is the increase/decrease of these values based on a the value in a year when event occurred. Several events can occur within one id, so a new event becomes the new baseline year for the id. To make things clearer, I also add the outcome I want below

What i have

id  value   year    event
a   100     1950    NA
a   101     1951    NA
a   102     1952    NA
a   103     1953    NA
a   104     1954    NA
a   105     1955    X
a   106     1956    NA
a   107     1957    NA
a   108     1958    NA
a   107     1959    Y
a   106     1960    NA
a   105     1961    NA
a   104.8   1962    NA
a   104.2   1963    NA
b   70      1970    NA
b   75      1971    NA
b   80      1972    NA
b   85      1973    NA
b   90      1974    NA
b   60      1975    Z
b   59      1976    NA
b   58      1977    NA
b   57      1978    NA
b   56      1979    NA
b   55      1980    W
b   54      1981    NA
b   53      1982    NA
b   52      1983    NA
b   51      1984    NA

What I am looking for

id  value   year    event   index   growth
a   100     1950    NA        0 
a   101     1951    NA        0 
a   102     1952    NA        0 
a   103     1953    NA        0 
a   104     1954    NA        0 
a   105     1955    X         1      1
a   106     1956    NA        2      1.00952381
a   107     1957    NA        3      1.019047619
a   108     1958    NA        4      1.028571429
a   107     1959    Y         1      1                  #new baseline year
a   106     1960    NA        2      0.990654206
a   105     1961    NA        3      0.981308411
a   104.8   1962    NA        4      0.979439252
a   104.2   1963    NA        5      0.973831776
b   70      1970    NA        6 
b   75      1971    NA        7 
b   80      1972    NA        8 
b   85      1973    NA        9 
b   90      1974    NA       10 
b   60      1975    Z         1      1
b   59      1976    NA        2      0.983333333
b   58      1977    NA        3      0.966666667
b   57      1978    NA        4      0.95
b   56      1979    NA        5      0.933333333
b   55      1980    W         1      1                #new baseline year
b   54      1981    NA        2      0.981818182
b   53      1982    NA        3      0.963636364
b   52      1983    NA        4      0.945454545
b   51      1984    NA        5      0.927272727

What I tried

This and this post were quite helpful and I managed to create differences between the years, however, I fail to reset the base year (index) when there is a new event. Furthermore, I am doubtful whether my approach is indeed the most efficient/elegant one. Seems a bit clumsy to me...

x <- ddply(x, .(id), transform, year.min=min(year[!is.na(event)]))  #identifies first event year
x1 <- ddply(x[x$year>=x$year.min,], .(id), transform, index=seq_along(id)) #creates counter years following first event; prior years are removed
x1 <- x1[order(x1$id, x1$year),] #sort 
x1 <- ddply(x1, .(id), transform, growth=100*(value/value[1])) #calculate difference, however, based on first event year; this is wrong.

library(Interact)  #i then merge the df with the years prior to first event which have been removed in the begining
x$id.year <- interaction(x$id,x$year)
x1$id.year <- interaction(x1$id,x1$year)
x$index <- x$growth <- NA
y <- rbind(x[x$year<x$year.min,],x1)
y <- y[order(y$id,y$year),]

Many thanks for any advice.


Solution

  • # Create a tag to indicate the start of each new event by id or
    # when id changes
    dat$tag <- with(dat, ave(as.character(event), as.character(id), 
                                        FUN=function(i) cumsum(!is.na(i))))
    
    # Calculate the growth by id and tag
    # this will also produce results for each id before an event has happened
    dat$growth <- with(dat, ave(value, tag, id,  FUN=function(i)  i/i[1] ))
    
    # remove growth prior to an event (this will be when tag equals zero as no 
    # event have occurred)
    dat$growth[dat$tag==0] <- NA