Search code examples
python-3.xregexpython-re

Python reformatting strings based on contents


In a pandas dataframe I have rows with contents in the following format:

1) abc123-Target 4-ufs
2) abc123-target4-ufs
3) geo.4
4) j123T4

All of these should be simply: target 4

So far my cleaning procedure is as follows:

df["point_id"] = df["point_id"].str.lower()
df["point_id"] = df['point_id'].str.replace('^.*?(?=target)', '')

This returns:

1) target 4-ufs
2) target4-ufs
3) geo.14
4) geo.2
5) j123T4

What I believe I need is:

a. Remove anything after the last number in the string, this solves 1
b. If 'target' does not have a space after it add a space, this with the above solves 2
c. If the string ends in a point and a number of any length remove everything before the point (incl. point) and replace with 'target ', this solves 3 and 4
d. If the string ends with a 't' followed by a number of any length remove everything before 't' and replace with 'target ', this solves 5

I'm looking at regex and re but the following is not having effect (add space before the last number)

df["point_id"] = re.sub(r'\D+$', '', df["point_id"])

Solution

  • Reading the rules, you might use 2 capture groups and check for the group values:

    \btarget\s*(\d+)|.*[t.](\d+)$
    
    • \btarget\s*(\d+) Match target, optional whitespace chars and capture 1+ digits in group 1
    • | Or
    • .*[t.] Match 0+ characters followed by either t or a .
    • (\d+)$ Capture 1+ digits in group 2 at the end of the string

    Regex demo | Python demo

    Python example:

    import re
    import pandas as pd
    
    pattern = r"\btarget\s*(\d+)|.*[t.](\d+)$"
    strings = [
        "abc123-Target 4-ufs",
        "abc123-target4-ufs",
        "geo.4",
        "j123T4"
    ]
    
    df = pd.DataFrame(strings, columns=["point_id"])
    
    def change(s):
        m = re.search(pattern, s, re.IGNORECASE)
        return "target " + (m.group(2) if m.group(2) else m.group(1))
    
    df["point_id"] = df["point_id"].apply(change)
    print(df)
    

    Output

       point_id
    0  target 4
    1  target 4
    2  target 4
    3  target 4