This post asks how to extract a string between other two strings in R: Extracting a string between other two strings in R
I'm seeking a similar answer, but now covering multiple occurences between patterns.
Example string:
Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante: NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricaçao:
Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".
Based on the post above, this code
gsub(".*Fabricante: *(.+) CNPJ:.*", "\\1", df$manufacturing_location[92])
returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".
When I change to
gsub(".*Fabricante: *(.*?) CNPJ:.*", "\\1", df$manufacturing_location[92])
it returns the first. I tried changing to \\2
as I thought this would number occurences, but then I get an empty string. I also tried using stringr
's str_match_all
, but it did not work too.
Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?
I would like to put this into a mutate
syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all
work.
We can use str_match_all
as follows:
x <- "Fabricante: EMS S/A CNPJ: - 57.507.378/0001-01 Endereço: SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante: EMS S/A CNPJ: - 57.507.378/0003-65 Endereço: HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante: NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ: - 12.424.020/0001-79 Endereço: MANAUS - AM - BRASIL Etapa de Fabricaçao:"
matches <- str_match_all(x, "(?<=\\bFabricante: ).*?(?= CNPJ:)")[[1]]
matches
[,1]
[1,] "EMS S/A"
[2,] "EMS S/A"
[3,] "NOVAMED FABRICA<U+00C7>AO DE PRODUTOS FARMACEUTICOS LTDA"
Here is an explanation of the regex pattern being used:
(?<=\\bFabricante: )
lookbehind and assert that Fabricante:
precedes.*?
then match all content until reaching the nearest(?= CNPJ:)
lookahead and assert that CNPJ:
follows