Search code examples
pythonpython-re

Regex splitting of multiple grouped delimeters


How do you group a combination of delimiters, such as 1. or 2)?

For example, given a string like, '1. I like food! 2. She likes 2 baloons.', how can you separate such a sentence?

As another example, given the input

'1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'

the output should be

['3D Technical', 'Process animations', 'Explained videos', 'Product launch videos']

I tried:

a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
re.split(r'[1.2.3.,1)2)3)/]+|etc', a)

The output was:

['',
 'D Technical',
 'Process animations',
 ' Explainer videos',
 ' Product launch videos']

Solution

  • Here is a way to get the expected result:

    import re
    
    a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
    r = [s for s in map(str.strip,re.split(r',? *[0-9]+(?:\)|\.) ?', a)) if s]
    
    print(*r,sep='\n')
    3D Technical/Process animations
    Explainer videos
    Product launch videos
    
    • The pattern r',? *[0-9]+(?:\)|\.) ?' for the separators can be broken down as follows:
      • ,? an optional trailing comma
      • * an optional space (or many) preceding the number
      • [0-9]+ a sequence of at least one digit
      • (?:\)|\.) followed by a closing parenthesis or a period. The ?: at the begining makes it a non-capturing group so that re.split doesn't include it in the output
      • ? an optional space after the parenthesis or period (you may want to remove the ? or replace it with a + depending on your actual data

    The output of re.split is mapped to str.strip to remove leading/trailing spaces. This is inside a list comprehension that will filter out empty strings (e.g. preceding the first separator)

    If commas or slashes without the numbering are also used as separators, you can add that to the pattern:

    def splitItems(a):
        pattern = r'/|,|(?:,? *[0-9]+(?:\)|\.) ?)'
        return [s for s in map(str.strip,re.split(pattern, a)) if s]
    

    output:

    a = '3D Technical/Process animations, Explainer videos, Product launch videos'
    print(*splitItems(a),sep='\n')
    
    3D Technical/Process animations
    Explainer videos
    Product launch videos
    
    
    a = '1. Hello 2. Hi'
    print(*splitItems(a),sep='\n')
    Hello
    Hi
    
    a = "Great, what's up?! , Awesome"
    print(*splitItems(a),sep='\n')
    Great
    what's up?!
    Awesome
    
    a = '1. Medicines2. Devices 3.Products'
    print(*splitItems(a),sep='\n')
    Medicines
    Devices
    Products
    
    a = 'ABC/DEF/FGH'
    print(*splitItems(a),sep='\n')
    ABC
    DEF
    FGH
    

    If your separators are a list of either-or patterns (meaning only one pattern applies consistently for a given string), then you can try them in order of precedence in a loop and return the first split that produces more than one part:

    def splitItems(a):
        for pattern in ( r'(?:,? *[0-9]+(?:\)|\.) ?)', r',', r'/' ):
            result = [*map(str.strip,re.split(pattern, a))]
            if len(result)>1: break
        return [s for s in result if s]
    

    Output:

    # same as all the above and this one:
    
    a = '1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning'
    print(*splitItems(a),sep='\n')
    
    Arrangement of Loans for Listed Corporates and their Group Companies
    Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their   Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc
    Estate Planning