Search code examples
rgroup-bycountunique

How can I group by and summarize while keeping unique values by group and count their occurence?


I have a data frame that looks something like:

df <- data.frame(resource = c("gold", "bronze", "gold", "silver", "silver", "gold", "gold", "silver"), price = (c(10, 15, 20, 12, 12, 10, 10, 15)), extraction = c(100, 200, 50, 200, 250, 100, 50, 50))


    r       p   e
1   gold    10  100
2   bronze  15  200
3   gold    20  50
4   silver  12  200
5   silver  12  250
6   gold    10  100
7   gold    10  50
8   silver  15  50

I would like to collapse this dataset by resource, such that I have one variable that counts total extraction volume and as many extra variables as there are unique prices of the resource. Additionally I would like to have another variable that, for each unique price, counts how many observations were valued at this price.

This would look something like:

ID r       total_extr. price1 n_price1 price2 n_price2
1  gold    300         10     3        20     1
2  silver  500         12     2        15     1
3  bronze  200         15     1        NA     NA

Ideally, prices would be ascending or descending (in my dataset, there are more than two different prices per group).

The first 50 rows of my original dataset are:

structure(list(extraction = c(NA, NA, 3800, 5000, 3800, 3800, 3800, 3800, 3800, 3800, 3800, 3800, 3800, 3800, 660, 3800, 125, 
3800, 3800, 660, 100, 3800, 40950, 250, 250, 150000, 35000, NA, 
1e+05, 53000, NA, 225000, NA, 260000, 260000, NA, 260000, NA, 
260000, NA, NA, 260000, 260000, 260000, 260000, 40, 523, NA, 
NA, 523), price = c(NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, NA, 
NA, NA, NA, 27226012, NA, 21677.578125, NA, NA, 21047.84765625, 
15398.7431640625, NA, 12181.1640625, 11378.0888671875, 11137.2998046875, 
0.326765239238739, 0.326765239238739, 0.326765239238739, 0.352094233036041, 
0.352094233036041, 0.307463765144348, 0.307463765144348, 0.280774921178818, 
0.280774921178818, 0.240696549415588, 0.240696549415588, 0.168027445673943, 
0.168027445673943, 0.144999995827675, 0.144999995827675, 0.131485313177109, 
0.131485313177109, 0.129491910338402, 0.103749454021454, 0.14696241915226, 
473.7353515625, NA, NA, NA, NA), resource = c("salt", "salt", 
"natural gas", "natural gas", "natural gas", "natural gas", "natural gas", 
"natural gas", "natural gas", "natural gas", "natural gas", "natural gas", 
"natural gas", "natural gas", "tin", "natural gas", "tin", "natural gas", 
"natural gas", "tin", "tin", "natural gas", "gold", "gold", "gold", 
"diamond", "diamond", "diamond", "diamond", "diamond", "diamond", 
"diamond", "diamond", "diamond", "diamond", "diamond", "diamond", 
"diamond", "diamond", "diamond", "diamond", "diamond", "diamond", 
"diamond", "diamond", "diamond", "natural gas", "natural gas", 
"natural gas", "natural gas")), row.names = c(NA, 50L), class = "data.frame")

Solution

  • You can do:

    library(tidyverse)
    df |> 
      group_by(resource) |> 
      add_count(price) |> 
      mutate(extraction = sum(extraction)) |> 
      distinct() |> 
      mutate(id = 1:n()) |> 
      ungroup() |> 
      pivot_wider(names_from = id,
                  values_from = c(price, n),
                  names_vary = "slowest") |> 
      mutate(order = c(1, 3, 2)) |> 
      arrange(order) |> 
      select(-order)
    
    # A tibble: 3 × 6
      resource extraction price_1   n_1 price_2   n_2
      <chr>         <dbl>   <dbl> <int>   <dbl> <int>
    1 gold            300      10     3      20     1
    2 silver          500      12     2      15     1
    3 bronze          200      15     1      NA    NA
    

    New solution with updated regex from TO.

    Basically, just add the na.rm = TRUE argument to the sum function:

    df |> 
      group_by(resource) |> 
      add_count(price) |> 
      mutate(extraction = sum(extraction, na.rm = TRUE)) |> 
      distinct() |> 
      mutate(id = 1:n()) |> 
      ungroup() |> 
      pivot_wider(names_from = id,
                  values_from = c(price, n),
                  names_vary = "slowest")
    
    # A tibble: 5 × 26
      extraction resource         price_1   n_1   price_2   n_2   price_3   n_3   price_4   n_4 price_5   n_5 price_6   n_6 price_7   n_7 price_8   n_8 price_9   n_9 price_10  n_10 price_11  n_11 price_12  n_12
           <dbl> <chr>              <dbl> <int>     <dbl> <int>     <dbl> <int>     <dbl> <int>   <dbl> <int>   <dbl> <int>   <dbl> <int>   <dbl> <int>   <dbl> <int>    <dbl> <int>    <dbl> <int>    <dbl> <int>
    1          0 salt              NA         2    NA        NA    NA        NA    NA        NA  NA        NA  NA        NA  NA        NA  NA        NA  NA        NA   NA        NA   NA        NA      NA     NA
    2      63046 natural gas       NA        20    NA        NA    NA        NA    NA        NA  NA        NA  NA        NA  NA        NA  NA        NA  NA        NA   NA        NA   NA        NA      NA     NA
    3       1545 tin         27226012         1 21678.        1 21048.        1 15399.        1  NA        NA  NA        NA  NA        NA  NA        NA  NA        NA   NA        NA   NA        NA      NA     NA
    4      41450 gold           12181.        1 11378.        1 11137.        1    NA        NA  NA        NA  NA        NA  NA        NA  NA        NA  NA        NA   NA        NA   NA        NA      NA     NA
    5    2643040 diamond            0.327     3     0.352     2     0.307     2     0.281     2   0.241     2   0.168     2   0.145     2   0.131     2   0.129     1    0.104     1    0.147     1     474.     1