Search code examples
pythontextnlppython-re

How to use Python re to remove all sub-strings starting with letters or numbers and ending with "PM"


I found some random codes caused by image files in my text file and I want to remove those random codes, which start with letters or numbers but end with "PM": for example, there is a text:

iSD08LXjpg2021330401PM大陸不可以給60歲以上人士打香港專找60歲以上人士去打,做白老鼠

日本與美國比還是很不錯的USA死亡才多呢日媒體報道jpg2021321056PM

An ideal result would be:

大陸不可以給60歲以上人士打香港專找60歲以上人士去打,做白老鼠

日本與美國比還是很不錯的USA死亡才多呢日媒體報道

but I don't know how to use re to remove it.


Solution

  • You want to remove every continuous segment of roman letters plus arabic numerals that end with PM. This is achieved by a simple regular expression:

    [a-zA-Z0-9]*PM

    a-z describes the range of all lowercase latin letters, equivalent for A-Z and 0-9. * indicates any amount of characters since your string can likely have arbitrary length. PM is the fixed end string.

    Of course, you have to make sure these strings don't contain special characters like ü. If they do, add groups of characters as appropriate.

    The actual python code would then be

        re.sub(r'[a-zA-Z0-9]*PM',"",inputtext)