Extracting multiple chunks of string between patterns

This post asks how to extract a string between other two strings in R: Extracting a string between other two strings in R

I'm seeking a similar answer, but now covering multiple occurences between patterns.

Example string:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:

Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

Based on the post above, this code

gsub(".*Fabricante: *(.+) CNPJ:.*", "\\1", df$manufacturing_location[92])

returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

When I change to

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\\1", df$manufacturing_location[92])

it returns the first. I tried changing to \\2 as I thought this would number occurences, but then I get an empty string. I also tried using stringr's str_match_all, but it did not work too.

Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?

I would like to put this into a mutate syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all work.

Solution

We can use str_match_all as follows:

x <- "Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:"
matches <- str_match_all(x, "(?<=\\bFabricante:  ).*?(?= CNPJ:)")[[1]]
matches

     [,1]                                                    
[1,] "EMS S/A"                                               
[2,] "EMS S/A"                                               
[3,] "NOVAMED FABRICA<U+00C7>AO DE PRODUTOS FARMACEUTICOS LTDA"

Here is an explanation of the regex pattern being used:

(?<=\\bFabricante: ) lookbehind and assert that Fabricante: precedes
.*? then match all content until reaching the nearest
(?= CNPJ:) lookahead and assert that CNPJ: follows