I have a large dataset with some x rows and y number of columns. one of the columns as words and some unwanted data. That unwanted data is has no specific pattern hence I am finding it difficult to remove that from the dataframe.
nonhashtag
['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
['Just', 'posted', 'photo', 'Rasim', 'Villa']
['Dhabi', 'International', 'Airport', '(AUH)', '\xd9\x85\xd8\xb7\xd8\xa7\xd8\xb1', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', '\xd8\xa7\xd9\x84\xd8\xaf\xd9\x88\xd9\x84\xd9\x8a', 'Dhabi']
['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
['Buona', 'notte', '\xd1\x81\xd0\xbf\xd0\xbe\xd0\xba\xd0\xbe\xd0\xb9\xd0\xbd\xd0\xbe\xd0\xb9', '\xd0\xbd\xd0\xbe\xd1\x87\xd0\xb8', '\xd9\x84\xd9\x8a\xd9\x84\xd8\xa9', '\xd8\xb3\xd8\xb9\xd9\x8a\xd8\xaf\xd8\xa9!', '\xd8\xa3\xd8\xa8\xd9\x88', '\xd8\xb8\xd8\xa8\xd9\x8a', 'Viceroy', 'Hotel,', 'Yas\xe2\x80\xa6']
Every character which is not a word is to be removed this is only one column in the large dataset. Column name is nonhashtag
What is the simple way to clean the column. straight away remove them or replace with NAN
Expected output
nonhashtag
['want', 'better', 'than', 'Dhabi,', 'United', 'Arab', 'Emirates']
['Just', 'posted', 'photo', 'Rasim', 'Villa']
['Dhabi', 'International', 'Airport', '(AUH)', 'Dhabi']
['just', 'shrug', 'off!', 'Dubai', 'Mall', 'Burj', 'Khalifa']
['out!', 'Cowboy', 'steppin', 'Notorious', 'going', 'sleep!', 'Make', 'happen']
['Buona', 'notte', 'Viceroy', 'Hotel,']
Every []
is one row in that particular column so removing of only the \x and remaining characters
is needed the empty []
should be left in the row. Keeping the row is important as other column's that row is filled with needed information.
To write a proper code I couldn't get pass through the input read as I am not able to find a pattern in the dataset to write a regex.
Thanks in advance for the help
I import lot of files and many a times columns names are dirty, they get Unwanted special characters and I don't know which all characters might come. I only want Underscores in column names and no spaces
df.columns = df.columns.str.strip()
df.columns = df.columns.str.replace(' ', '_')
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")
df.columns = df.columns.str.replace(r"[^a-zA-Z\d\_]+", "")