I have the following strings:
"LP, bar, company LLP, foo, LLP"
"LLP, bar, company LLP, foo, LP"
"LLP,bar, company LLP, foo,LP" # note the absence of a space after/before comma to be removed
I am looking for a regex that takes those inputs and returns the following:
"LP bar, company LLP, foo LLP"
"LLP bar, company LLP, foo LP"
"LLP bar, company LLP, foo LP"
What I have so fat is this:
import re
def fix_broken_entity_names(name):
"""
LLP, NAME -> LLP NAME
NAME, LP -> NAME LP
"""
pattern_end = r'^(LL?P),'
pattern_beg_1 = r', (LL?P)$'
pattern_beg_2 = r',(LL?P)$'
combined = r'|'.join((pattern_beg_1, pattern_beg_2, pattern_end))
return re.sub(combined, r' \1', name)
When I run it tho:
>>> fix_broken_entity_names("LP, bar, company LLP, foo,LP")
Out[1]: ' bar, company LLP, foo '
I'd be very thankful for any tips or solutions :)
You can use
import re
texts = ["LP, bar, company LLP, foo, LLP","LLP, bar, company LLP, foo, LP","LLP,bar, company LLP, foo,LP"]
for text in texts:
result = ' '.join(re.sub(r"^(LL?P)\s*,|,\s*(LL?P)$", r" \1\2 ", text).split())
print("'{}' -> '{}'".format(text, result))
Output:
'LP, bar, company LLP, foo, LLP' -> 'LP bar, company LLP, foo LLP'
'LLP, bar, company LLP, foo, LP' -> 'LLP bar, company LLP, foo LP'
'LLP,bar, company LLP, foo,LP' -> 'LLP bar, company LLP, foo LP'
See a Python demo. The regex is ^(LL?P)\s*,|,\s*(LL?P)$
:
^(LL?P)\s*,
- start of string, LLP
or LP
(Group 1), zero or more whitespaces, comma|
- or,\s*(LL?P)$
- a comma, zero or more whitespaces, LP
or LLP
(Group 2) and then of string.Note the replacement is a concatenation of Group 1 and 2 values enclosed within single spaces, and a post-process step is to remove all leading/trailing whitespace and shrink whitespace inside the string to single spaces.