Search code examples
rregexstringtidytext

Find characters before and after dollar amount in vector of text data in R


I have a vector of text data (news data). I am trying to scan the text for any money amount and the text surrounding this amount. I managed this with the first element of my vector but struggle with using a loop and list to repeat the process for all data. I use str_extract_currencies from stringr which does a good job in detecting numbers. It may be possible with regular expressions, but I don't know how.

textdata <- data.frame(document = c(1,2),
                       txt = c("Outplay today announced its $7.3M series A fundraise from Sequoa Capital India. ..., which is poised to be a $5.59B market by 2023, is a huge opportunity for Outplay.", "India's leading digital care ecosystem for chronic condition management – has raised USD 5.7 million in funding led by US-based venture capital firm, W Health Ventures. The funding also saw participation from e-pharmacy Unicorn PharmEasy (a Threpsi Solutions Pvt Ltd brand), Merisis VP and existing investors Orios VP, Leo Capital, and others. With around 463 million people with diabetes and $1.13  billion with hypertension across the world"))

numbers <- str_extract_currencies(textdata$txt[1]) %>% 
  filter(curr_sym == '$')

for (i in 1:nrow(numbers)){
  print( stringr::str_extract(textdata$txt[1], paste0(".{0,20}", numbers$amount[i], ".{0,20}")))
}

finaldata <- data.frame(document = c(1,1,2),
                        money_related = c("oday announced its $7.3M series A fundraise",
                                          " is poised to be a $5.59B market by 2023, is",
                                          "with diabetes and $1.13  billion with hyper"))

A document may contain 0 or multiple instances of money amounts. I like to store it to a data.frame like this:

> finaldata
  document                                money_related
1        1  oday announced its $7.3M series A fundraise
2        1  is poised to be a $5.59B market by 2023, is
3        2  with diabetes and $1.13  billion with hyper

Thank you very much.


Solution

  • Here is a tidyverse solution without the {strex} package. But probably you would need to run it against your real data and add several other possible cases:

    library(tidyverse)
    
    textdata %>% 
      rowwise(document) %>% 
      summarise(txt = str_extract_all(txt, ".{1,20}(\\${1}|USD)[0-9.]+\\s?[A-z]?.{1,20}")) %>% 
      unnest_longer(txt)
    
    #> `summarise()` has grouped output by 'document'. You can override using the `.groups` argument.
    #> # A tibble: 3 x 2
    #> # Groups:   document [2]
    #>   document txt                                             
    #>      <dbl> <chr>                                           
    #> 1        1 "today announced its $7.3M series A fundraise " 
    #> 2        1 "h is poised to be a $5.59B market by 2023, is "
    #> 3        2 "e with diabetes and $1.13  billion with hypert"
    

    Created on 2022-01-21 by the reprex package (v2.0.1)