Extracting/classifying quantitative information from character variable

I have a data.frame of the following type

id	text_information
1	Increase from 10.81% to 60.1%
2	Purchase 100.00 %
3	Increase from 5.9% to 45.48%
4	Purchase 99.0%

I would like to process the text_information (character) variable such that I obtain the following output:

id	share	share_difference	type
1	0.601	0.492	increase
2	1	NA	purchase
3	0.455	0.396	increase
4	0.99	NA	pruchase

A suggestion how this could be done using R?

Solution

Use regular expressions:

library(dplyr)
library(stringr)

data.frame(
  id = 1:4,
  text_information = c(
    "Increase from 10.81% to 60.1%", 
    "Purchase 100.00%", 
    "Increase from 5.9% to 45.48%", 
    "Purchase 99.0%"
  )
) %>% 
  mutate(
    share_1 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)%", 1)),
    share_2 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)% to (\\d+\\.\\d+)%$", 2)),
    share = if_else(is.na(share_2), share_1, share_2) / 100,
    share_difference = (share_2 - share_1) / 100,
    type = tolower(str_extract(text_information, "(Increase|Purchase)"))
  ) %>% 
  select(id, share, share_difference, type)
#>   id  share share_difference     type
#> 1  1 0.6010           0.4929 increase
#> 2  2 1.0000               NA purchase
#> 3  3 0.4548           0.3958 increase
#> 4  4 0.9900               NA purchase

^{Created on 2024-04-08 with reprex v2.1.0}