I have a messy list of strings (list_strings
), where I am able to remove using regex
the unwanted characters, but I am struggling to also remove the closing bracket ]
. How can I also remove those ? I guess I am very close...
#the list to clean
list_strings = ['[ABC1: text1]', '[[DC: this is a text]]', '[ABC-O: potatoes]', '[[C-DF: hello]]']
#remove from [ up to :
for string in list_strings:
cleaned = re.sub(r'[\[A-Z\d\-]+:\s*', '', string)
print(cleaned)
# current output
>>>text1]
>>>this is a text]]
>>>potatoes]
>>>hello]
Desired output:
text1
this is a text
potatoes
hello
You can use
cleaned = re.sub(r'^\[+[A-Z\d-]+:\s*|]+$', '', string)
See the Python demo and the regex demo.
Alternatively, to make sure the string starts with [[word:
and ends with ]
s, you may use
cleaned = re.sub(r'^\[+[A-Z\d-]+:\s*(.*?)\s*]+$', r'\1', string)
See this regex demo and this Python demo.
And, in case you simply want to extract that text inside, you may use
# First match only
m = re.search(r'\[+[A-Z\d-]+:\s*(.*?)\s*]', string)
if m:
print(m.group(1))
# All matches
matches = re.findall(r'\[+[A-Z\d-]+:\s*(.*?)\s*]', string)
See this regex demo and this Python demo.
Details
^
- start of string\[+
- one or more [
chars[A-Z\d-]+
- one or more uppercase ASCII letters, digits or -
chars:
- a colon\s*
- zero or more whitespaces|
- or]+$
- one or more ]
chars at the end of string.Also, (.*?)
is a capturing group with ID 1 that matches any zero or more chars other than line break chars, as few as possible. \1
in the replacement refers to the value stored in this group memory buffer.