I'm trying to tokenize the words within dataframe which looks like
A B C D E F
0 Orange robot x eyes discomfort striped tee nan
1 orange robot blue beams grin vietnam jacket nan
2 aquamarine robot 3d bored cigarette nan
After removing all the special characters the dataframe became a string like this
df_str = df.to_string(header=False)
import re
normalised_text = bayc_features_str.lower()
text = re.sub(r"[^\a-zA-Z0-9 ]","", normalised_text)
print(text)
1 orange robot x eyes discomfort striped tee nan
2 orange robot blue beams grin vietnam jacket nan
3 aquamarine robot 3d bored cigarette nan
so when I tokenize this string, with below code
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str):
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj
else:
return obj
tokenized_text = (tokenize(text))
I get the output
['orange', 'robot', 'x', 'eyes', 'discomfort', 'striped', 'tee', nan,'orange', 'robot', 'blue', 'beams', 'grin', 'vietnam', 'jacket', nan,'aquamarine', 'robot', '3d', 'bored', 'cigarette', nan, 'sea', 'captains', 'hat']
which is quite different from the output I expected
[['orange'], ['robot'], ['x', 'eyes'], ['discomfort'], ['striped', 'tee'], nan]
[['orange'], ['robot'], ['blue', 'beams'], ['grin'], ['vietnam', 'jacket'], nan]
[['aquamarine'], ['robot'], ['3d'], ['bored', 'cigarette'], nan, ['sea', 'captains', 'hat']]
Any ideas on how can I get the output I expected? Any help would be greatly appreciated!
Don't convert DataFrame
to string but work with every text in DataFrame
separatelly.
Use.applymap(function) to execute function on every text (on every cell in DataFrame
).
new_df = df.applymap(tokenize)
result = new_df.values.tolist()
Minimal working example:
import pandas as pd
from nltk.tokenize import word_tokenize
data = {
'Background': ['Orange', 'Orange', 'Aqua'],
'Fur': ['Robot', 'Robot', 'Robot'],
'Eyes': ['X Eyes', 'Blue Beams', '3d'],
'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
'Clothes': ['Striped Tee', 'Vietman Jacket', None],
'Hat': [None, None, "Sea Captain's Hat"],
}
df = pd.DataFrame(data)
print(df.to_string()) # `to_string()` to display full dataframe without `...`
# ----------------------------------------
def tokenize(obj):
if obj is None:
return None
elif isinstance(obj, str):
return word_tokenize(obj)
elif isinstance(obj, list):
return [tokenize(i) for i in obj]
else:
return obj
new_df = df.applymap(tokenize)
result = new_df.values.tolist()
print(result)
Result:
Background Fur Eyes Mouth Clothes Hat
0 Orange Robot X Eyes Discomfort Striped Tee None
1 Orange Robot Blue Beams Grin Vietman Jacket None
2 Aqua Robot 3d Bored Cigarette None Sea Captain's Hat
[
[['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None],
[['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None],
[['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]