Search code examples
pythonpython-3.xregexregex-group

getting same regex groups inside a block of text


I trying to write a pattern to get each CPNJ group inside a this block of text, but the condition is that, is needed starts with executados: and ends with a CNPJ group. But, my pattern always get the last group, I don't know what I should do for it's works.

The answer getting specific groups of patterns inside a block text does not works!

regex101

pattern: (?:executados\:)[\p{L}\s\D\d]+CNPJ\W+(?P<cnpj>\d+\.\d+\.\d+\/\d+-\d+)

string to test:

Dados dos executados:
1. FOO TEST STRING LTDA., CNPJ: 88.888.888/8888-88,
2. ANOTHER TEST STRING LTDA LTDA LTDA - ME, CNPJ: 99.999.999/9999-99,
3. FOO TEST STRING LTDA., CPF: 999.999.999-99,
4. FOO TEST STRING LTDA., CPF: 999.999.999-99.
Como medida de economia e celeridade processuais, atribuo a

I would to get the values {'cnpj': ['88.888.888/8888-88', '99.999.999/9999-99']}, this way is getting just the last.


Solution

  • You can use PyPi regex module with the regex like

    (?s)(?<=executados:.*?)CNPJ\W+(\d+\.\d+\.\d+/\d+-\d+)
    

    See the regex demo.

    Here is the Python demo:

    import regex
    text = """Dados dos executados:
    1. FOO TEST STRING LTDA., CNPJ: 99.999.999/9999-99,
    2. ANOTHER TEST STRING LTDA LTDA LTDA - ME, CNPJ: 99.999.999/9999-99,
    3. FOO TEST STRING LTDA., CPF: 999.999.999-99,
    4. FOO TEST STRING LTDA., CPF: 999.999.999-99.
    Como medida de economia e celeridade processuais, atribuo a"""
    print( regex.findall(r'(?s)(?<=executados:.*?)CNPJ\W+(\d+\.\d+\.\d+/\d+-\d+)', text) )
    

    yielding

    ['99.999.999/9999-99', '99.999.999/9999-99']
    

    The regex matches

    • (?s) - regex.DOTALL, enables . to match line break chars
    • (?<=executados:.*?) - right before the current location, there must be executados: and then any zero or more chars
    • CNPJ - a fixed string
    • \W+ - one or more non-word chars
    • (\d+\.\d+\.\d+/\d+-\d+) - the return value of regex.findall, Group 1: one or more digits and a . twice, then one or more digits, /, one or more digits, -` and one or more digits.