Search code examples
pythonregexstringintegerpython-re

Regex not working properly for some cases (python)?


I have a data frame where one column has string values and the other has integers but those columns have special characters with it or the string data has integers with it. So to remove it I used regex my regex is working fine but for the integer column, if 'abc123' is then it is not removing the abc and same with string column if '123abc' is there then it is not removing it. I don't know if the pattern or is wrong or the code is wrong. Below is my code,

d = [['abc','123'],['gbc@*','23abc'],['abc123','abc200'],['124abc','1230&*'],['abcer£$%&*!','230!?*&']]
df1= pd.DataFrame(d, columns=['str','int'])
print(df1)

   str     int
0   abc    123
1   gbc@*   23abc
2   abc123  abc200
3   124abc  1230&*
4   abcer£$%&*! 230!?*&

num = r'\d+$'
alpha = r'[a-zA-Z]+$'
wrong = df1[~df1['int'].str.contains(num, na=True)]
correct_int = [re.sub(r'([^\d]+?)', '', item) for item in wrong['int']]
print(correct_int)
wrong_str = df1[~df1['str'].str.contains(alpha, na=True)]
correct_str = [re.sub(r'([^a-zA-Z ]+?)', '', item) for item in df1['str']]
print(correct_str)

    

Output:

correct_int: ['23', '1230', '230']

As you can see it removed for '23abc','1230&*','230!?*&' but not for 'abc200' as the string was coming first

correct_str: ['abc', 'gbc', 'abc', 'abc', 'abcer']

now it removed for all but sometimes it's not removing when the value is '124abc'

Is my pattern wrong? I have also tried giving different patterns but nothing worked

I am removing the integers and special characters in the column 'str' and removing string values and special characters in column 'int' Expected output: Once after cleaning and replacing with the old with the cleaned values the output would look like this.

    str     int
0   abc     123
1   gbc     23
2   abc     200
3   abc     1230
4   abcer   230

Solution

  • You can do it with

    df1['str'] = df1['str'].str.replace(r"[\d\W+]", '') # replaces numbers (\d) and non-word characters (\W) with empty strings
    
    df1['int'] = df1['int'].str.replace(r"\D+", '') # replaces any non-decimal digit character (like [^0-9])
    

    Returns:

        str     int
    0   abc     123
    1   gbc     23
    2   abc     200
    3   abc     1230
    4   abcer   230