Search code examples
rdata.tablebinning

Binning data in R with the same output as in spreadsheet


I have this sample data:

set.seed(25)

xx <- data.table(
  year = 2015,
  values = iris$Sepal.Length,
  score = sample(1:8, nrow(iris), replace = TRUE))

The actual data contains many years and rows. I wanted to grouped the values column using the cut() function in base R but the result is different from the result generated by LibreOffice Calc (even in MS Office Excel) pivot. This is what I have done so far:

brks <- seq(0, ceiling(max(xx$values)), 0.5)
xx[, bins := cut(values, brks, ordered_result = TRUE)]
xx_binned <- dcast(xx, bins ~ year, length, value.var = "values")
xx_binned <- melt(xx_binned, id.vars = "bins", value.name = "value")

I started at 0 so that it will be consistent if I use different data. In the spreadsheet I also started at 0 as the starting number.

The result of the above codes is this:

     bins   variable value
1   (4,4.5] 2015     5
2   (4.5,5] 2015     27
3   (5,5.5] 2015     27
4   (5.5,6] 2015     30
5   (6,6.5] 2015     31
6   (6.5,7] 2015     18
7   (7,7.5] 2015     6
8   (7.5,8] 2015     6

This is the result of LibreOffice Calc:

values  2015
4-4.5   15
4.5-5   106
5-5.5   100
5.5-6   142
6-6.5   148
6.5-7   95
7-7.5   25
7.5-8   27

How can I make it the same? I am writing a function converting a spreadsheet tools into R function and I want it to be the same as in the output of the spreadsheet.

Thanks.


Solution

  • You have to sum up the score not the number of cases to come to the same values.

    aggregate(xx$score, list(cut(xx$values, brks, right=FALSE, ordered_result = TRUE)), sum)
    #  Group.1   x
    #1 [4,4.5)  15
    #2 [4.5,5) 106
    #3 [5,5.5) 100
    #4 [5.5,6) 142
    #5 [6,6.5) 148
    #6 [6.5,7)  95
    #7 [7,7.5)  25
    #8 [7.5,8)  27
    

    Or updating your code:

    library(data.table)
    xx <- data.table(xx)
    xx[, bins := cut(values, brks, right=FALSE, ordered_result = TRUE)]
    dcast(xx, bins ~ year, sum, value.var = "score")
    

    Data:

    set.seed(25)
    
    xx <- data.frame(
      year = 2015,
      values = iris$Sepal.Length,
      score = sample(1:8, nrow(iris), replace = TRUE))
    brks <- seq(0, ceiling(max(xx$values)), 0.5)