Search code examples
pythonregexsplitseparator

Python: Split string by list of separators


In Python, I'd like to split a string using a list of separators. The separators could be either commas or semicolons. Whitespace should be removed unless it is in the middle of non-whitespace, non-separator characters, in which case it should be preserved.

Test case 1: ABC,DEF123,GHI_JKL,MN OP
Test case 2: ABC;DEF123;GHI_JKL;MN OP
Test case 3: ABC ; DEF123,GHI_JKL ; MN OP

Sounds like a case for regular expressions, which is fine, but if it's easier or cleaner to do it another way that would be even better.

Thanks!


Solution

  • This should be much faster than regex and you can pass a list of separators as you wanted:

    def split(txt, seps):
        default_sep = seps[0]
    
        # we skip seps[0] because that's the default separator
        for sep in seps[1:]:
            txt = txt.replace(sep, default_sep)
        return [i.strip() for i in txt.split(default_sep)]
    

    How to use it:

    >>> split('ABC ; DEF123,GHI_JKL ; MN OP', (',', ';'))
    ['ABC', 'DEF123', 'GHI_JKL', 'MN OP']
    

    Performance test:

    import timeit
    import re
    
    
    TEST = 'ABC ; DEF123,GHI_JKL ; MN OP'
    SEPS = (',', ';')
    
    
    rsplit = re.compile("|".join(SEPS)).split
    print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
    # 1.6242462980007986
    
    print(timeit.timeit(lambda: split(TEST, SEPS)))
    # 1.3588597209964064
    

    And with a much longer input string:

    TEST = 100 * 'ABC ; DEF123,GHI_JKL ; MN OP , '
    
    print(timeit.timeit(lambda: [s.strip() for s in rsplit(TEST)]))
    # 130.67168392999884
    
    print(timeit.timeit(lambda: split(TEST, SEPS)))
    # 50.31940778599528