I have a data.frame
of the following type
id | text_information |
---|---|
1 | Increase from 10.81% to 60.1% |
2 | Purchase 100.00 % |
3 | Increase from 5.9% to 45.48% |
4 | Purchase 99.0% |
I would like to process the text_information (character
) variable such that I obtain the following output:
id | share | share_difference | type |
---|---|---|---|
1 | 0.601 | 0.492 | increase |
2 | 1 | NA | purchase |
3 | 0.455 | 0.396 | increase |
4 | 0.99 | NA | pruchase |
A suggestion how this could be done using R
?
Use regular expressions:
library(dplyr)
library(stringr)
data.frame(
id = 1:4,
text_information = c(
"Increase from 10.81% to 60.1%",
"Purchase 100.00%",
"Increase from 5.9% to 45.48%",
"Purchase 99.0%"
)
) %>%
mutate(
share_1 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)%", 1)),
share_2 = as.numeric(str_extract(text_information, "(\\d+\\.\\d+)% to (\\d+\\.\\d+)%$", 2)),
share = if_else(is.na(share_2), share_1, share_2) / 100,
share_difference = (share_2 - share_1) / 100,
type = tolower(str_extract(text_information, "(Increase|Purchase)"))
) %>%
select(id, share, share_difference, type)
#> id share share_difference type
#> 1 1 0.6010 0.4929 increase
#> 2 2 1.0000 NA purchase
#> 3 3 0.4548 0.3958 increase
#> 4 4 0.9900 NA purchase
Created on 2024-04-08 with reprex v2.1.0