Search code examples
pythonregexdataframepython-realphanumeric

Best way to clean up the unmatched records given by REGEX (Alphanumeric) using python?


I have a dataset where there are column with alphanumeric values. I am able to filter the unmatched records now I want to clean them, that means for example the unmatched record is 123*abc& then it should remove 123abc. I have done it but I don't think this is a correct way to do it and also the data is combined after the final results I can use for loop for getting them correctly but that will be a slow process. Hence looking for a easier way (column by column cleaning). Is it possible to do so?

data = ['abc123','abc*123&','Abc123','ABC@*&123',np.nan,'123*Abc']
df=pd.DataFrame(data, columns=['a'])
print(df)
       a
0     abc123
1   abc*123&
2     Abc123
3  ABC@*&123
4        NaN
5    123*Abc

Filtering unmatched records:

wrong=df[~df['a'].str.contains(numeric, na=True)]
print(wrong)
        a
1   abc*123&
3  ABC@*&123
5    123*Abc


wrong_index = wrong.index
result = ''.join(i for i in wrong['a'] if not i.isalpha())  
alphanumeric = [character for character in result if character.isalnum()]
alphanumeric = "".join(alphanumeric)
df['a'].loc[wrong_index]=alphanumeric
print(df)
     a
0   abc123
1   abc123ABC123123Abc
2   Abc123
3   abc123ABC123123Abc
4   NaN
5   abc123ABC123123Abc

I know why is this happening, can be resolved by using for or loop through each row but it consumes a lot of time. Is there any way where we can do column by column clean up?

Excepted output:

       a
0     abc123
1     abc123
2     Abc123
3     ABC123
4        NaN
5     123Abc

Solution

  • Using plain vanilla Python using the built-in Regex module re will do. See this demo on IDEone: using regular-expression to replace list elements

    import re
    
    data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc']
    cleaned = [re.sub('\W', '', item) for item in data]
    print(cleaned)
    

    The script will output:

    ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc']

    Explained

    1. The re.sub function substitutes a given string (here: item) like search & replace.
    2. The search is specified by regular-expression: \W all non-word characters (i.e. non numerical digits, non alpha letters).
    3. The replace is specified by an empty string to simply remove the found: ''.
    4. The for-loop is implemented as list-comprehension, a ideomatic or pythonic way for iterating through elements of a list.

    Filter parts of above

    If you like to filter to only parts, like only alphabetic or only numeric chars, then you need to combine metacharacters of PCRE like in this demo on IDEone:

    import re
    
    data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc', '123_abc', '123 abc']
    
    # replace non-alphas and non-digits; filter [A-Za-z0-9_]
    alphanumeric_underscore = [re.sub('\W', '', item) for item in data]
    print('alphanumeric_underscore', alphanumeric_underscore)
    
    # replace also the underscore; filter [A-Za-z0-9]
    alphanumeric = [re.sub('[\W_]', '', item) for item in data]
    print('alphanumeric', alphanumeric)
    
    # filter only digits
    numeric = [re.search(r"\d+", item).group(0) for item in data]
    print('numeric', numeric)
    
    # filter only alphas
    alpha = [re.search(r"[A-Za-z]+", item).group(0) for item in data]
    print('alpha', alpha)
    

    It will output:

    alphanumeric_underscore ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123_abc', '123abc'] alphanumeric ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123abc', '123abc'] numeric ['123', '123', '123', '123', '123', '123', '123'] alpha ['abc', 'abc', 'Abc', 'ABC', 'Abc', 'abc', 'abc']

    It uses the regex-search re.search with regex (prefixed as raw-string) r"\d+" to return all found occurrences .group(0), hence filter.

    See also