Search code examples
pythonstringsubstringpython-re

match multiple substrings using findall from re library


I have a large array that contains strings with the following format in Python

some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE]

I just need to extract the substrings that start with MATH, SCIENCE and ART. So what I'm currently using

  my_str = re.findall('MATH_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

    my_str = re.findall('SCIENCE_.*? ', some_array )

    if len(my_str) !=0:
        print(my_str)

    my_str = re.findall('ART_.*? ', some_array )

    if len(my_str) > 0:
        print(my_str)

It seems to work, but I was wondering if the findall function can look for more than one substring in the same line or maybe there is a cleaner way of doing it with another function.


Solution

  • You could also match optional non whitespace characters after one of the alternations, start with a word boundary to prevent a partial word match and match the trailing single space:

    \b(?:MATH|SCIENCE|ART)_\S*

    Regex demo

    Or if only word characters \w:

    \b(?:MATH|SCIENCE|ART)_\w*

    Example

    import re
    
    some_array = ['MATH_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
                  'SCIENCE_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE',
                  'ART_SOME_TEXT_AND_NUMBER MORE_TEXT  SOME_VALUE']
    
    pattern = re.compile(r"\b(?:MATH|SCIENCE|ART)_\S* ")
    for s in some_array:
        print(pattern.findall(s))
    

    Output

    ['MATH_SOME_TEXT_AND_NUMBER ']
    ['SCIENCE_SOME_TEXT_AND_NUMBER ']
    ['ART_SOME_TEXT_AND_NUMBER ']