How do you group a combination of delimiters, such as 1.
or 2)
?
For example, given a string like, '1. I like food! 2. She likes 2 baloons.'
, how can you separate such a sentence?
As another example, given the input
'1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
the output should be
['3D Technical', 'Process animations', 'Explained videos', 'Product launch videos']
I tried:
a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
re.split(r'[1.2.3.,1)2)3)/]+|etc', a)
The output was:
['',
'D Technical',
'Process animations',
' Explainer videos',
' Product launch videos']
Here is a way to get the expected result:
import re
a = '1) 3D Technical/Process animations, 2) Explainer videos, 3) Product launch videos'
r = [s for s in map(str.strip,re.split(r',? *[0-9]+(?:\)|\.) ?', a)) if s]
print(*r,sep='\n')
3D Technical/Process animations
Explainer videos
Product launch videos
r',? *[0-9]+(?:\)|\.) ?'
for the separators can be broken down as follows:
,?
an optional trailing comma *
an optional space (or many) preceding the number[0-9]+
a sequence of at least one digit(?:\)|\.)
followed by a closing parenthesis or a period. The ?:
at the begining makes it a non-capturing group so that re.split doesn't include it in the output ?
an optional space after the parenthesis or period (you may want to remove the ? or replace it with a + depending on your actual dataThe output of re.split is mapped to str.strip to remove leading/trailing spaces. This is inside a list comprehension that will filter out empty strings (e.g. preceding the first separator)
If commas or slashes without the numbering are also used as separators, you can add that to the pattern:
def splitItems(a):
pattern = r'/|,|(?:,? *[0-9]+(?:\)|\.) ?)'
return [s for s in map(str.strip,re.split(pattern, a)) if s]
output:
a = '3D Technical/Process animations, Explainer videos, Product launch videos'
print(*splitItems(a),sep='\n')
3D Technical/Process animations
Explainer videos
Product launch videos
a = '1. Hello 2. Hi'
print(*splitItems(a),sep='\n')
Hello
Hi
a = "Great, what's up?! , Awesome"
print(*splitItems(a),sep='\n')
Great
what's up?!
Awesome
a = '1. Medicines2. Devices 3.Products'
print(*splitItems(a),sep='\n')
Medicines
Devices
Products
a = 'ABC/DEF/FGH'
print(*splitItems(a),sep='\n')
ABC
DEF
FGH
If your separators are a list of either-or patterns (meaning only one pattern applies consistently for a given string), then you can try them in order of precedence in a loop and return the first split that produces more than one part:
def splitItems(a):
for pattern in ( r'(?:,? *[0-9]+(?:\)|\.) ?)', r',', r'/' ):
result = [*map(str.strip,re.split(pattern, a))]
if len(result)>1: break
return [s for s in result if s]
Output:
# same as all the above and this one:
a = '1. Arrangement of Loans for Listed Corporates and their Group Companies, 2. Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc 3. Estate Planning'
print(*splitItems(a),sep='\n')
Arrangement of Loans for Listed Corporates and their Group Companies
Investment Services wherein we assist Corporates, Family Offices, Business Owners and Professionals to invest their Surplus Funds to invest in different products such as Stocks, Mutual Funds, Bonds, Fixed Deposit, Gold Bonds,PMS etc
Estate Planning