Search code examples
rstringdataframedplyrdata-cleaning

Find the maximum number in an R dataframe column of strings


For each cell in a particuar column of a dataframe (which here we will simply name as df), I want to find the value of maximum and minimum number that is originally represented as a string, embedded in a string. Any commas present in the cell have no special significance. These numbers should not be a percentage, so if for example 50% appears then 50 is to be excluded from consideration. The relevant column of the dataframe looks something like this:

| particular_col_name | 
| ------------------- | 
| First Row String10. This is also a string_5, and so is this 20, exclude70% |
| Second_Row_50%, number40. Number 4. number_15|

So two new columns should be created with title 'maximum_number' and 'minimum number, and in the case of the first row the former should say 20 and 5 respectively. Note that 70 has been excluded because of the % sign next to it. Similarly, the second row should put 40 and 4 into the new columns.

I have tried a couple of methods (e.g. str_extract_all, regmatches, strsplit), within the dplyr 'mutate' operator, but they either give error messages (particularly regarding the input column particular_col_name) or do not output the data in an appropriate format for the maximum and minimum values to be easily identified.

Any help on this would be most appreciated please.


Solution

  • library(tidyverse)
    
    tibble(
      particular_col_name = c(
        "First Row String10. This is also a string_5, and so is this 20, exclude70%",
        "Second_Row_50%, number40. Number 4. number_15",
        "20% 30%"
      )
    ) %>%
      mutate(
        numbers = particular_col_name %>% map(~ {
          .x %>% str_remove_all("[0-9]+%") %>% str_extract_all("[0-9]+") %>% simplify() %>% as.numeric()
        }),
        min = numbers %>% map_dbl(~ .x %>% min() %>% na_if(Inf) %>% na_if(-Inf)),
        max = numbers %>% map_dbl(~ .x %>% max() %>% na_if(Inf) %>% na_if(-Inf))
      ) %>%
      select(-numbers)
    #> Warning in min(.): no non-missing arguments to min; returning Inf
    #> Warning in max(.): no non-missing arguments to max; returning -Inf
    #> # A tibble: 3 x 3
    #>   particular_col_name                                                  min   max
    #>   <chr>                                                              <dbl> <dbl>
    #> 1 First Row String10. This is also a string_5, and so is this 20, e…     5    20
    #> 2 Second_Row_50%, number40. Number 4. number_15                          4    40
    #> 3 20% 30%                                                               NA    NA
    

    Created on 2022-02-22 by the reprex package (v2.0.0)