Search code examples
pythonregexetlregex-lookarounds

REGEX: Remove spaces between strings with one or two letters


Consider the following original strings showed in the first columns of the following table:

Original String                  Parsed String                   Desired String
'W. & J. JOHNSON LMT.COM'       #W    J  JOHNSON LIMITED        #WJ JOHNSON LIMITED
'NORTH ROOF & WORKS CO. LTD.'   #NORTH ROOF   WORKS CO  LTD     #NORTH ROOF WORKS CO LTD
'DAVID DOE & CO., LIMITED'      #DAVID DOE   CO   LIMITED       #DAVID DOE CO LIMITED
'GEORGE TV & APPLIANCE LTD.'    #GEORGE TV   APPLIANCE LTD      #GEORGE TV APPLIANCE LTD 
'LOVE BROS. & OTHERS LTD.'      #LOVE BROS    OTHERS LTD        #LOVE BROS OTHERS LTD
'A. B. & MICHAEL CLEAN CO. LTD.'#A  B    MICHAEL CLEAN CO  LTD  #AB MICHAEL CLEAN CO LTD
'C.M. & B.B. CLEANER INC.'      #C M    B B  CLEANER INC        #CMBB CLEANER INC

Punctuation needs to be removed which I have done as follows:

def transform(word):
    word = re.sub(r'(?<=[A-Za-z])\'(?=[A-Za-z])[A-Z]|[^\w\s]|(.com|COM)',' ',word)

However, there is one last point which I have not been able to get. After removing punctuations I ended up with lots of spaces. How can I have a regular expression that put together initials and keep single spaces for regular words (no initials)?

Is this a bad approach to substitute the mentioned characters to get the desired strings?

Thanks for allowing me to continue learning :)


Solution

  • I think it's simpler to do this in parts. First, remove .com and any punctuation other than space or &. Then, remove a space or & surrounded by only one letter. Finally, replace any remaining sequence of space or & with a single space:

    import re
    strings = ['W. & J. JOHNSON LMT.COM',
    'NORTH ROOF & WORKS CO. LTD.',
    'DAVID DOE & CO., LIMITED',
    'GEORGE TV & APPLIANCE LTD.',
    'LOVE BROS. & OTHERS LTD.',
    'A. B. & MICHAEL CLEAN CO. LTD.',
    'C.M. & B.B. CLEANER INC.'
    ]
    
    for s in strings:
        s = re.sub(r'\.COM|[^a-zA-Z& ]+', '', s, 0, re.IGNORECASE)
        s = re.sub(r'(?<=\b\w)\s*[ &]\s*(?=\w\b)', '', s)
        s = re.sub(r'\s*[& ]\s*', ' ', s)
        print s
    

    Output

    WJ JOHNSON LMT
    NORTH ROOF WORKS CO LTD
    DAVID DOE CO LIMITED
    GEORGE TV APPLIANCE LTD
    LOVE BROS OTHERS LTD
    AB MICHAEL CLEAN CO LTD
    CM BB CLEANER INC
    

    Demo on rextester

    Update

    This was written before the edit to the question changing the required result for the last data. Given the edit, the above code can be simplified to

    for s in strings:
         s = re.sub(r'\.COM|[^a-zA-Z ]+|\s(?=&)|(?<!\w\w)\s+(?!\w\w)', '', s, 0, re.IGNORECASE)
         print s
    

    Demo on rextester