Search code examples
rstringrstring-matching

Extracting multiple chunks of string between patterns


This post asks how to extract a string between other two strings in R: Extracting a string between other two strings in R

I'm seeking a similar answer, but now covering multiple occurences between patterns.

Example string:

Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:

Between each occurrence of the words "Fabricante" and "CNPJ", there is a company name, which I would like to extract. In this string, there are three such companies: "EMS S/A", "EMS S/A", and "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

Based on the post above, this code

gsub(".*Fabricante: *(.+) CNPJ:.*", "\\1", df$manufacturing_location[92])

returns the last occurrence, "NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS".

When I change to

gsub(".*Fabricante: *(.*?) CNPJ:.*", "\\1", df$manufacturing_location[92])

it returns the first. I tried changing to \\2 as I thought this would number occurences, but then I get an empty string. I also tried using stringr's str_match_all, but it did not work too.

Anyone knows how to adjust the syntax so I can taylor the code to return each of the three as needed?

I would like to put this into a mutate syntax where I can pass this onto a dataset with many such strings, and return the first, second, and third entries as variables. For this, I have found I cannot make str_match_all work.


Solution

  • We can use str_match_all as follows:

    x <- "Fabricante:  EMS S/A CNPJ:  - 57.507.378/0001-01  Endereço:  SAO BERNARDO DO CAMPO - SP - BRASIL Etapa de Fabricaçao: Fabricante:  EMS S/A CNPJ:  - 57.507.378/0003-65  Endereço:  HORTOLANDIA - SP - BRASIL Etapa de Fabricaçao: Fabricante:  NOVAMED FABRICAÇAO DE PRODUTOS FARMACEUTICOS LTDA CNPJ:  - 12.424.020/0001-79  Endereço:  MANAUS - AM - BRASIL Etapa de Fabricaçao:"
    matches <- str_match_all(x, "(?<=\\bFabricante:  ).*?(?= CNPJ:)")[[1]]
    matches
    
         [,1]                                                    
    [1,] "EMS S/A"                                               
    [2,] "EMS S/A"                                               
    [3,] "NOVAMED FABRICA<U+00C7>AO DE PRODUTOS FARMACEUTICOS LTDA"
    

    Here is an explanation of the regex pattern being used:

    • (?<=\\bFabricante: ) lookbehind and assert that Fabricante: precedes
    • .*? then match all content until reaching the nearest
    • (?= CNPJ:) lookahead and assert that CNPJ: follows