Search code examples
rdataframebinning

how to bin computed data by percentile in an R data frame


I have a data frame that contains a large number of symbols, dates, and values

date         symbol value
2014-01-03     A      2.5
2014-01-04     A      3.1
2014-01-06     A      4.5
2014-01-03     B      2.6
2014-01-05     B      3.2
2014-01-06     B      4.3

I want to split the data by symbol, compute a percentage change for the 2 most recent dates, and bin the data by some variable number of groups where the 1st group has the largest set of pct. change, the next has the 2nd largest and so on. Each group needs to have approximately the same number of symbols.

Ideally, I would like my new data frame to look something like this

date         symbol value       pctchg     bin
2014-01-03     A      2.5       .45161      1
2014-01-04     A      3.1       .45161      1
2014-01-06     A      4.5       .45161      1
2014-01-03     B      2.6       .34375      2
2014-01-05     B      3.2       .34375      2
2014-01-06     B      4.3       .34375      2

This seems like a perfect task for ddply, but I'm struggling to get something to work. Any suggestions would be very much appreciated. Thank you for your time and help.


Solution

  • I'm not an experienced coder, but I'll field this candidate:

    df <- read.table(sep=" ", header=T, text="
    date symbol value
    2014-01-03 A 2.5
    2014-01-04 A 3.1
    2014-01-06 A 4.5
    2014-01-03 B 2.6
    2014-01-05 B 3.2
    2014-01-06 B 4.3")
    
    library(plyr)
    df <- df[order(df$symbol, df$date),]
    df <- ddply(df, "symbol", transform, pctchg=value[length(value)]/value[length(value)-1]-1)
    df <- df[order(-df$pctchg),]
    
    bins <- 2
    
    library(ggplot2)
    groups <- cut_number(1:length(unique(df$pctchg)), n=bins)
    levels(groups) <- 1:length(levels(groups))
    df <- merge(x=df, y=cbind.data.frame(symbol=unique(df$symbol), bin=groups))
    df[order(-df$pctchg),]
    #   symbol       date value    pctchg bin
    # 1      A 2014-01-03   2.5 0.4516129   1
    # 2      A 2014-01-04   3.1 0.4516129   1
    # 3      A 2014-01-06   4.5 0.4516129   1
    # 4      B 2014-01-03   2.6 0.3437500   2
    # 5      B 2014-01-05   3.2 0.3437500   2
    # 6      B 2014-01-06   4.3 0.3437500   2