Search code examples
rdataframerangemultiple-columns

Function to create a new column from existing group ranges in a dataframe


The head of my dataframe is as follows:

                 ranges  n
1             (0,1e+04] 13
2         (1e+04,2e+04] 11
3         (2e+04,3e+04] 21
4         (3e+04,4e+04] 14
5         (4e+04,5e+04]  9
6         (5e+04,6e+04]  8
7         (6e+04,7e+04] 13
8         (7e+04,8e+04] 11
9         (8e+04,9e+04] 16
10        (9e+04,1e+05] 16

This shows ranges of specific data and the number of counts per each of the groups.

I'd like to create a new column between the two existing ones. It should consist of values representing the middle point of each group range( i.e., 5000 for the first group, 15,000 for the second and so on.

So far, I've managed to add a new column with the command:

add_column(Position = "Value",
             .after="ranges")

The results are:

                 ranges Position  n
1             (0,1e+04]    Value 13
2         (1e+04,2e+04]    Value 11
3         (2e+04,3e+04]    Value 21
4         (3e+04,4e+04]    Value 14
5         (4e+04,5e+04]    Value  9
6         (5e+04,6e+04]    Value  8
7         (6e+04,7e+04]    Value 13
8         (7e+04,8e+04]    Value 11
9         (8e+04,9e+04]    Value 16
10        (9e+04,1e+05]    Value 16

I'm still not sure how to add the middle value of each groups' range instead of just "Value" using this command.

Any suggestions on how to do this?


Solution

  • mydf <- data.frame(
      ranges  =
       c("(0,1e+04]",
       "(1e+04,2e+04]",
       "(2e+04,3e+04]",
       "(3e+04,4e+04]",
       "(4e+04,5e+04]",
       "(5e+04,6e+04]",
       "(6e+04,7e+04]",
       "(7e+04,8e+04]",
       "(8e+04,9e+04]",
       "(9e+04,1e+05]")
    )
    
    library(dplyr)
    library(stringr)
    library(tidyr)
    library(readr)
    
    mydf |> 
      separate(ranges, into = c("min", "max"), sep = ",") |> 
      mutate(min = str_remove(min, "\\("),
             max = str_remove(max, "\\]"),
             across(c(min, max), parse_number)) |> 
      rowwise() |> 
      mutate(value = mean(c_across(c(min, max)))) |> 
      ungroup()
    
    # A tibble: 10 × 3
         min    max value
       <dbl>  <dbl> <dbl>
     1     0  10000  5000
     2 10000  20000 15000
     3 20000  30000 25000
     4 30000  40000 35000
     5 40000  50000 45000
     6 50000  60000 55000
     7 60000  70000 65000
     8 70000  80000 75000
     9 80000  90000 85000
    10 90000 100000 95000