Search code examples
rstringstringr

Extracting character sequences from text using the stringr package in R


I have a column with texts, named 'OBSERVA.' In the midst of this text, there may be a sequence of 8 digits corresponding to a code that I would like to extract for filling another column. For example, one of the tuples in the OBSERVA column has the following record: 'DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.' In this case, I need to extract the numbers 29932940. I used the 'str_extract' function from the 'stringr' package, but I did not get a satisfactory result (the sequence of 8 numbers is not identified, I just have NA's).

library(stringr)
dados_sivep_tratados$Teste  <- ifelse(
  dados_sivep_tratados$NU_DO == 0 & !is.na(dados_sivep_tratados$OBSERVA),
  str_extract(dados_sivep_tratados$OBSERVA, "\\b\\d{8}\\b"),
  NA
)

Solution

  • Example with different lengths of the number before -

    library(stringr)
    
    df <- data.frame(
      OBS = c(
        "DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.",
        "DO 29932967840-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL."
      )
    )
    
    df$ExtractedNumber <- str_extract(df$OBS, "\\d+(?=-)")
    
    
    print(df$ExtractedNumber)
    [1] "29932940"    "29932967840"