I am new in Python regular expression. here is my text:
'Condition: Remanufactured Grade: Commercial Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled Processing Time: Ships from our Warehouse in 2-4 Weeks
I want to add comma using python regular expression and the result will be look like this:
'Condition: Remanufactured ,Grade: Commercial ,Warranty: 1 Year Parts & On-Site Labor w/Ext. Ships: Fully Assembled ,Processing Time: Ships from our Warehouse in 2-4 Weeks
Basically I want to target words which contain colon and want to add comma from second string.
Honestly I wouldn't do this with a regular expression, in large part based on your "Processing Time" example, which makes it looks like you've got a problem which can only be solved by knowing the specific expected strings to solve.
Code can't magically know that "Processing " is more tightly bound to "Time" than to "Fully Assembled".
So I see basically three solution shapes, and I'm just going to focus on the first one because I think its the best one, but I'll briefly summarize all three:
Use a list of known field names which make the comma insertions harder, and replace those strings just for the duration of your comma-insertion logic. This frees your comma-insertion logic to be simpler and regular.
Get a list of all known field names, and look for them specifically to insert commas in front of them. This is probably worse but if the list of names doesn't change and isn't expected to change, and most names are tricky, then this could be cleaner.
Throw a modern language modeling text prediction AI at the problem: given an ambiguous string like "...: Fully Assembled Processing Time: ..." you could basically prompt your AI with "Assembled" and see how much confidence it gives to the next tokens being "Processing Time", and then prompt it with "Processing" and see how much confidence it gives to the next tokens being "Time", and pick the one it has more confidence for as your field name. I think this is overkill unless you really get so little guarantees about your input that you have to treat it like a natural language processing problem.
So I would do option 1, and the general idea looks something like this:
tricky_fields = {
"Processing Time": "ProcessingTime",
# add others here as needed
}
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {proper_name}: ", f" {easier_name}: ")
# do the actual comma insertions here
for proper_name, easier_name in tricky_fields:
my_text = my_text.replace(f" {easier_name}: ", f" {proper_name}: ")
Notice that I put spaces and the colon around the field names in the replacements. If you know that your fields are always separated by spaces and colons like that, this is better practice because it's less likely to automatically replace something you didn't mean to replace, and thus less likely to be a source of bugs later.
Then the comma insertion itself becomes an easy regex if all of your replacements don't use any spaces or colons, because your target is just [^ :]+:
, but regex is a cryptic micro-language which is not optimized for human readability, and it doesn't need to be a regex, because you can just split on :
and then for each result of that split you can split on the last
and then rejoin with ,
or ,
and then rejoin the whole thing:
def insert_commas(text):
parts = text.split(":")
new_parts = []
for part in parts:
most, last = part.split(" ", -1)
new_part = " ,".join((most, last))
new_parts.append(new_part)
return ":".join(new_parts)
But if you really wanted to use a regex, here's a simple one that does what you want:
def insert_commas(text):
return re.sub(' ([^ :]+: )', r' ,\1', text)
Although in real production code I'd improve the tricky field replacements by factoring the two replacements out into one separate testable function and use something bidict
instead of a regular dictionary, like this:
from bidict import bidict
tricky_fields = bidict({
"Processing Time": "ProcessingTime",
# add others here as needed
})
def replace_fields(names, text):
for old_name, new_name in names:
text = text.replace(f" {old_name}: ", f" {new_name}: ")
return text
Using a bidict
and a dedicated function is clearer, more self-descriptive, more maintainable, less code to keep consistent, and easier to test/verify, and even gets some runtime safety against accidentally mapping two tricky field names to the same replacement field.
So composing those two previous code blocks together:
text = replace_fields(tricky_fields, text)
text = insert_commas(text)
text = replace_fields(tricky_fields.inverse, text)
Of course, if you don't need to do the second replacement to undo the initial replacement, you can just leave it as-is after comma insertion is done. Either way, this way decomposed the comma problem from the problem of tricky names which make the comma problem harder/complected.