Search code examples
rdistributionfrequencycut

Number of active items in bins


I have a list of items with 2 dates (start date and end date) and duration in days (end date - start date). I want to cut them into bins to show the number of "active items" in each bin, i.e. if start date <= bin date and end date > bin date, the item should be counted in the bin.

Item        StartDate     EndDate       Duration
Machine1    2005/01/21    2011/03/29    2258
Machine2    2004/05/12    2012/05/08    2918
Machine3    2004/10/15    2005/09/10    330
Machine4    2004/08/30    2011/08/02    2528
Machine5    2005/06/06    2010/12/03    2006
Machine6    2004/05/11    2007/03/17    1040
Machine7    2005/08/09    2011/05/30    2120
Machine8    2005/01/06    2012/06/07    2709
Machine9    2005/06/13    2008/08/28    1172
Machine10   2005/06/28    2010/04/08    1745
Machine11   2004/11/09    2007/05/14    916
Machine12   2005/05/26    2012/09/16    2670
Machine13   2004/05/28    2009/06/09    1838
Machine14   2005/01/06    2012/05/25    2696
Machine15   2005/08/20    2012/02/11    2366
Machine16   2004/08/02    2011/10/23    2638
Machine17   2004/08/10    2009/03/15    1678
Machine18   2005/05/08    2006/04/17    344
Machine19   2005/08/26    2006/07/24    332
Machine20   2004/03/30    2006/05/07    768

Bin counts that I want to produce:

2004/01/01  0
2005/01/01  9
2006/01/01  19
2007/01/01  16
2008/01/01  14
2009/01/01  13
2010/01/01  11
2011/01/01  9
2012/01/01  5
2013/01/01  0

As you can see, the totals of the bins do not add up to the total number of items, as you would expect with a traditional histogram.

I can do this with some verbose code, but I'm sure there must be some short way, using cut or split. I'm aware that the bin labels are off by one according to my definition above, but let's ignore that for now.


Solution

  • A way is:

    #turn dates to actual dates
    DF$StartDate <- as.Date(DF$StartDate, "%Y/%m/%d")
    DF$EndDate <- as.Date(DF$EndDate, "%Y/%m/%d")
    binDF[,1] <- as.Date(binDF[,1], "%Y/%m/%d")
    
    counts <- colSums(sapply(binDF[,1], function(x) {DF$StartDate <= x & DF$EndDate > x}))
    #> counts
    #[1]  0  9 19 16 14 13 11  9  5  0
    

    And as a complete dataframe:

    resDF <- data.frame(dates = binDF[,1], counts = counts, stringsAsFactors = F)
    #> resDF
    #        dates counts
    #1  2004-01-01      0
    #2  2005-01-01      9
    #3  2006-01-01     19
    #4  2007-01-01     16
    #5  2008-01-01     14
    #6  2009-01-01     13
    #7  2010-01-01     11
    #8  2011-01-01      9
    #9  2012-01-01      5
    #10 2013-01-01      0
    

    The dataframes DF and binDF:

    DF <- structure(list(Item = c("Machine1", "Machine2", "Machine3", "Machine4", 
    "Machine5", "Machine6", "Machine7", "Machine8", "Machine9", "Machine10", 
    "Machine11", "Machine12", "Machine13", "Machine14", "Machine15", 
    "Machine16", "Machine17", "Machine18", "Machine19", "Machine20"
    ), StartDate = c("2005/01/21", "2004/05/12", "2004/10/15", "2004/08/30", 
    "2005/06/06", "2004/05/11", "2005/08/09", "2005/01/06", "2005/06/13", 
    "2005/06/28", "2004/11/09", "2005/05/26", "2004/05/28", "2005/01/06", 
    "2005/08/20", "2004/08/02", "2004/08/10", "2005/05/08", "2005/08/26", 
    "2004/03/30"), EndDate = c("2011/03/29", "2012/05/08", "2005/09/10", 
    "2011/08/02", "2010/12/03", "2007/03/17", "2011/05/30", "2012/06/07", 
    "2008/08/28", "2010/04/08", "2007/05/14", "2012/09/16", "2009/06/09", 
    "2012/05/25", "2012/02/11", "2011/10/23", "2009/03/15", "2006/04/17", 
    "2006/07/24", "2006/05/07"), Duration = c(2258L, 2918L, 330L, 
    2528L, 2006L, 1040L, 2120L, 2709L, 1172L, 1745L, 916L, 2670L, 
    1838L, 2696L, 2366L, 2638L, 1678L, 344L, 332L, 768L)), .Names = c("Item", 
    "StartDate", "EndDate", "Duration"), class = "data.frame", row.names = c(NA, 
    -20L))
    
    binDF <- structure(list(V1 = c("2004/01/01", "2005/01/01", "2006/01/01", 
    "2007/01/01", "2008/01/01", "2009/01/01", "2010/01/01", "2011/01/01", 
    "2012/01/01", "2013/01/01"), V2 = c(0L, 9L, 19L, 16L, 14L, 13L, 
    11L, 9L, 5L, 0L)), .Names = c("V1", "V2"), class = "data.frame", row.names = c(NA, 
    -10L))