Search code examples
rfunctional-programmingaggregateplyrdiscretization

Discretizing score relative to mean


I have data with a date, zip code and score. I would like to discretize the data such that all rows of the same month and same zip code above the mean for that same month and zip code get a 1, all others get a zero.

so example (data frame is called score_df):

date       zip      score
2014-01-02 12345    10
2014-01-03 12345    20
2014-01-04 12345    2
2014-01-05 99885    15
2014-01-06 99885    12

output:

date       zip      score    above_avg
2014-01-02 12345    10       0
2014-01-03 12345    20       1
2014-01-04 12345    3        0
2014-01-05 99885    15       1
2014-01-06 99885    12       0

So far I have been using inefficient solutions:

1.Looping through all months and applying the binary condition with an ifelse statement

score_df$above_avg <- rep(0,length(score_df$score))
for (month in (1:12)) {
score_df$above_avg <- ifelse(as.numeric(substring(score_df$date,6,7)) == month,ifelse(score_df$score>quantile(score_df$score[as.numeric(substring(score_df$date,6,7)) == month],(0.5)),1,0),score_df$above_avg)
}

2.I also tried to generate an average table using aggregate, then joining the average column to the original data frame and then applying a binary condition

avg_by_month_zip <- aggregate(score~month+zip,data=score_df,FUN=mean)
score_df$mean <- sqldf("select * from score_df join avg_by_month_zip on avg_by_month_zip.zip = score_df.zip and avg_by_month_zip.month = score_df.month")
score_df$discrete <- ifelse(score_df$score>score_df$mean,1,0)

I would like to do this functionally. I know how to do it functionally with one condition (just date or just zip) but not with two. I could concatenate the two fields to make one unique field. That would be a quick fix, but I was wondering if there is a way to do this simply and efficiently with an apply function or plyr.


Solution

  • Assuming you have your date values properly encoded as such (for example)

    score_df <- structure(list(date = structure(c(16072, 16073, 16074, 16075, 
    16076), class = "Date"), zip = c(12345L, 12345L, 12345L, 99885L, 
    99885L), score = c(10L, 20L, 2L, 15L, 12L)), .Names = c("date", 
    "zip", "score"), row.names = c(NA, -5L), class = "data.frame")
    

    then you can do

    with(score_df, ave(score, strftime(date, "%m"), zip, 
        FUN=function(x) ifelse(x>mean(x), 1, 0)))
    # [1] 0 1 0 1 0
    

    We use ave() to calculate the value for all the month/zip combinations (we use strftime() to get the month from the date).