Search code examples
regexpysparkstreet-address

How to move a word position in a sentence in pyspark


I have the following street addresses:

- KR 71D 6 94 SUR LC 1709
- KR 24B 15 20 SUR AP 301
- KR 72F 39 42 SUR
- KR 72F SUR 39 42
- KR 72 SUR 39 42

What I need is detect the word SUR only located after the address plate, remove it and then setter after the main address. For example:

- KR 71D 6 94 SUR LC 1709  <-- Change it to: KR 71D SUR 6 94 LC 1709
- KR 24B 15 20 SUR AP 301 <-- Change it to: KR 24B SUR 15 20 AP 301
- KR 72F 39 42 SUR <-- Change it to: KR 72F SUR 39 42
- KR 72F SUR 39 42 <-- It is ok, leave it this way
- KR 72 SUR 39 42 <-- It is ok, leave it this way

Thanks a lot, and I hope somebody could help me.


Solution

  • You can try this:

    import re
    
    lyst = ["KR 71D 6 94 SUR LC 1709","KR 24B 15 20 SUR AP 301","KR 72F 39 42 SUR","KR 72F SUR 39 42","KR 72 SUR 39 42"]
    
    comp = re.compile(r'([a-zA-Z]+)(\s)(\w+)\s(\d+)\s(\d+)\s([a-zA-Z]+)(.*)$')
    

    Logic:

    Using the logic of capturing the match in parenthesis, you can capture all the matches of words(inclusive numbers and words) separated by spaces, for the match of SUR, we need the fifth word to be matched and inserted at third position. So, we capture that in \6 (one greater than 5 because we are also matching one space). After this match, pick everything else in the single match using (.*). We are using here sub from re module. For the last two strings since the pattern never passes hence nothing is replaced and the string will remain as it is.

    newlyst = []
    for items in lyst:
        newlyst.append(re.sub(comp, r'\1\2\3\2\6\2\4\2\5\7', items))
    

    You can print the newlyst to see the output:

    Output:

    ['KR 71D SUR 6 94 LC 1709', 'KR 24B SUR 15 20 AP 301', 'KR 72F SUR 39 42', 'KR 72F SUR 39 42', 'KR 72 SUR 39 42']