Search code examples
rsummarize

Summarize the lowest values in a Dataframe?


My data frame looks like this:

View(df)
Product     Value
  a           2
  b           4 
  c           3
  d           10
  e           15
  f           5
  g           6
  h           4
  i           50
  j           20
  k           35
  l           25
  m           4
  n           6
  o           30
  p           4
  q           40
  r           5
  s           3
  t           40

I want to find the 9 most expensive products and summaries the rest. It should look like this:

Product     Value 
  d           10
  e           15
  i           50
  j           20
  k           35
  l           25
  o           30
  q           40
  t           40
 rest         46

Rest is the sum of the other 11 products. I tried it with summaries, but it didn't work:

new <- df %>%
  group_by(Product)%>%
summarise((Value > 10) = sum(Value)) %>%
  ungroup()

Solution

  • We can use dplyr::row_number to effectively rank the observations after using arrange to order the data by Value. Then, we augment the Product column so that any values that aren't in the top 9 are coded as Rest. Finally, we group by the updated Product and take the sum using summarise

    dat %>%
        arrange(desc(Value)) %>%
        mutate(RowNum = row_number(),
               Product = ifelse(RowNum <= 9, Product, 'Rest')) %>%
        group_by(Product) %>%
        summarise(Value = sum(Value))
    
    # A tibble: 10 × 2
       Product Value
         <chr> <int>
    1        d    10
    2        e    15
    3        i    50
    4        j    20
    5        k    35
    6        l    25
    7        o    30
    8        q    40
    9     Rest    46
    10       t    40
    

    data

    dat <- structure(list(Product = c("a", "b", "c", "d", "e", "f", "g", 
    "h", "i", "j", "k", "l", "m", "n", "o", "p", "q", "r", "s", "t"
    ), Value = c(2L, 4L, 3L, 10L, 15L, 5L, 6L, 4L, 50L, 20L, 35L, 
    25L, 4L, 6L, 30L, 4L, 40L, 5L, 3L, 40L)), .Names = c("Product", 
    "Value"), class = "data.frame", row.names = c(NA, -20L))