I have a column with texts, named 'OBSERVA.' In the midst of this text, there may be a sequence of 8 digits corresponding to a code that I would like to extract for filling another column. For example, one of the tuples in the OBSERVA column has the following record: 'DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.' In this case, I need to extract the numbers 29932940. I used the 'str_extract' function from the 'stringr' package, but I did not get a satisfactory result (the sequence of 8 numbers is not identified, I just have NA's).
library(stringr)
dados_sivep_tratados$Teste <- ifelse(
dados_sivep_tratados$NU_DO == 0 & !is.na(dados_sivep_tratados$OBSERVA),
str_extract(dados_sivep_tratados$OBSERVA, "\\b\\d{8}\\b"),
NA
)
Example with different lengths of the number before -
library(stringr)
df <- data.frame(
OBS = c(
"DO 29932940-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL.",
"DO 29932967840-2 OCCUPATION: RETIRED INFLUENZA UNDER ANALYSIS GAL."
)
)
df$ExtractedNumber <- str_extract(df$OBS, "\\d+(?=-)")
print(df$ExtractedNumber)
[1] "29932940" "29932967840"