Search code examples
rsubstringcharacter

Extracting/classifying quantitative information from character variable


I have a data.frame of the following type

id text_information
1 Increase from 10.81% to 60.1%
2 Purchase 100.00 %
3 Increase from 5.9% to 45.48%
4 Purchase 99.0%

I would like to process the text_information (character) variable such that I obtain the following output:

id share share_difference type
1 0.601 0.492 increase
2 1 NA purchase
3 0.455 0.396 increase
4 0.99 NA pruchase

A suggestion how this could be done using R?


Solution

  • Use regular expressions:

    library(dplyr)
    library(stringr)
    
    data.frame(
      id = 1:4,
      text_information = c(
        "Increase from 10.81% to 60.1%", 
        "Purchase 100.00%", 
        "Increase from 5.9% to 45.48%", 
        "Purchase 99.0%"
      )
    ) %>% 
      mutate(
        share_1 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)%", 1)),
        share_2 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)% to (\\d+\\.\\d+)%$", 2)),
        share = if_else(is.na(share_2), share_1, share_2) / 100,
        share_difference = (share_2 - share_1) / 100,
        type = tolower(str_extract(text_information, "(Increase|Purchase)"))
      ) %>% 
      select(id, share, share_difference, type)
    #>   id  share share_difference     type
    #> 1  1 0.6010           0.4929 increase
    #> 2  2 1.0000               NA purchase
    #> 3  3 0.4548           0.3958 increase
    #> 4  4 0.9900               NA purchase
    

    Created on 2024-04-08 with reprex v2.1.0