Search code examples
rregexstringnlp

How to extract different patterns in string in R?


I want to extract a pattern of phrases from the following sentences.

text1 <- "On a year-on-year basis, the number of subscribers of Netflix increased 1.15% in November last year."

text2 <- "There is no confirmed audited number of subscribers in the Netflix's earnings report."

text3 <- "Netflix's unaudited number of subscribers has grown more than 1.50% at the last quarter."

The pattern is number of subscribers or audited number of subscribers or unaudited number of subscribers.

I am using the following pattern \\bnumber\\s+of\\s+subscribers?\\b from a previous problem (Thanks to @wiktor-stribiżew) and then extracting the phrases.

find_words <- function(text){
  
  pattern <- "\\bnumber\\s+of\\s+subscribers?\\b" # something like this

  str_extract(text, pattern)

}

However, this extracts the exact number of subscriber not the other patterns.

Desired output:

find_words(text1)

'number of subscribers'

find_words(text2)

'audited number of subscribers'

find_words(text3)

'unaudited number of subscribers'


Solution

  • See if this works

    find_words <- function(text){
    
    pattern <- "(audited |unaudited )?number\\s+of\\s+subscribers"
    
    str_extract(text, pattern)
    
    }
    

    You can test it with the sample texts you provided:

    find_words(text1)
    # 'number of subscribers'
    find_words(text2)
    # 'audited number of subscribers'
    find_words(text3)
    # 'unaudited number of subscribers'