Search code examples
pythonregexpandasdataframedata-cleaning

Simple way to remove special characters and alpha numerical from dataframe


I have a large dataset with some x rows and y number of columns. one of the columns as words and some unwanted data. That unwanted data is has no specific pattern hence I am finding it difficult to remove that from the dataframe.

nonhashtag
['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
['Just', 'posted', 'photo', 'Rasim', 'Villa']
['Dhabi', 'International', 'Airport', '(AUH)', '\xd9\x85\xd8\xb7\xd8\xa7\xd8\xb1', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', '\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a', 'Dhabi']
['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
['Buona', 'notte', '\xd1\x81\xd0\xbf\xd0\xbe\xd0\xba\xd0\xbe\xd0\xb9\xd0\xbd\xd0\xbe\xd0\xb9', '\xd0\xbd\xd0\xbe\xd1\x87\xd0\xb8', '\xd9\x84\xd9\x8a\xd9\x84\xd8\xa9', '\xd8\xb3\xd8\xb9\xd9\x8a\xd8\xaf\xd8\xa9!', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', 'Viceroy', 'Hotel,', 'Yas\xe2\x80\xa6']

Every character which is not a word is to be removed this is only one column in the large dataset. Column name is nonhashtag

What is the simple way to clean the column. straight away remove them or replace with NAN

Expected output

nonhashtag
    ['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
    ['Just', 'posted', 'photo', 'Rasim', 'Villa']
    ['Dhabi', 'International', 'Airport', '(AUH)', 'Dhabi']
    ['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
    ['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
    ['Buona', 'notte', 'Viceroy', 'Hotel,']

Every [] is one row in that particular column so removing of only the \x and remaining characters is needed the empty [] should be left in the row. Keeping the row is important as other column's that row is filled with needed information.

To write a proper code I couldn't get pass through the input read as I am not able to find a pattern in the dataset to write a regex.

Thanks in advance for the help


Solution

  • I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces

    df.columns = df.columns.str.strip()     
    df.columns = df.columns.str.replace(' ', '_')         
    df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")    
    df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")