I have a dataset where there are column with alphanumeric values. I am able to filter the unmatched records now I want to clean them, that means for example the unmatched record is 123*abc&
then it should remove 123abc
. I have done it but I don't think this is a correct way to do it and also the data is combined after the final results I can use for loop for getting them correctly but that will be a slow process. Hence looking for a easier way (column by column cleaning). Is it possible to do so?
data = ['abc123','abc*123&','Abc123','ABC@*&123',np.nan,'123*Abc']
df=pd.DataFrame(data, columns=['a'])
print(df)
a
0 abc123
1 abc*123&
2 Abc123
3 ABC@*&123
4 NaN
5 123*Abc
Filtering unmatched records:
wrong=df[~df['a'].str.contains(numeric, na=True)]
print(wrong)
a
1 abc*123&
3 ABC@*&123
5 123*Abc
wrong_index = wrong.index
result = ''.join(i for i in wrong['a'] if not i.isalpha())
alphanumeric = [character for character in result if character.isalnum()]
alphanumeric = "".join(alphanumeric)
df['a'].loc[wrong_index]=alphanumeric
print(df)
a
0 abc123
1 abc123ABC123123Abc
2 Abc123
3 abc123ABC123123Abc
4 NaN
5 abc123ABC123123Abc
I know why is this happening, can be resolved by using for or loop through each row but it consumes a lot of time. Is there any way where we can do column by column clean up?
Excepted output:
a
0 abc123
1 abc123
2 Abc123
3 ABC123
4 NaN
5 123Abc
Using plain vanilla Python using the built-in Regex module re
will do.
See this demo on IDEone: using regular-expression to replace list elements
import re
data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc']
cleaned = [re.sub('\W', '', item) for item in data]
print(cleaned)
The script will output:
['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc']
re.sub
function substitutes a given string (here: item
) like
search & replace.\W
all non-word characters (i.e. non numerical digits, non alpha
letters).''
.If you like to filter to only parts, like only alphabetic or only numeric chars, then you need to combine metacharacters of PCRE like in this demo on IDEone:
import re
data = ['abc123','abc*123&','Abc123','ABC@*&123','123*Abc', '123_abc', '123 abc']
# replace non-alphas and non-digits; filter [A-Za-z0-9_]
alphanumeric_underscore = [re.sub('\W', '', item) for item in data]
print('alphanumeric_underscore', alphanumeric_underscore)
# replace also the underscore; filter [A-Za-z0-9]
alphanumeric = [re.sub('[\W_]', '', item) for item in data]
print('alphanumeric', alphanumeric)
# filter only digits
numeric = [re.search(r"\d+", item).group(0) for item in data]
print('numeric', numeric)
# filter only alphas
alpha = [re.search(r"[A-Za-z]+", item).group(0) for item in data]
print('alpha', alpha)
It will output:
alphanumeric_underscore ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123_abc', '123abc'] alphanumeric ['abc123', 'abc123', 'Abc123', 'ABC123', '123Abc', '123abc', '123abc'] numeric ['123', '123', '123', '123', '123', '123', '123'] alpha ['abc', 'abc', 'Abc', 'ABC', 'Abc', 'abc', 'abc']
It uses the regex-search re.search
with regex (prefixed as raw-string) r"\d+"
to return all found occurrences .group(0)
, hence filter.