splitting string made out of dataframe row wise

I'm trying to tokenize the words within dataframe which looks like

  A            B       C          D            E           F
0 Orange     robot   x eyes   discomfort   striped tee    nan
1 orange     robot  blue beams   grin      vietnam jacket nan
2 aquamarine robot   3d          bored        cigarette   nan

After removing all the special characters the dataframe became a string like this

df_str = df.to_string(header=False)
    
import re

normalised_text = bayc_features_str.lower()
text = re.sub(r"[^\a-zA-Z0-9 ]","", normalised_text)

print(text)

    1    orange   robot   x eyes   discomfort   striped tee   nan
    2    orange   robot   blue beams   grin   vietnam jacket  nan
    3    aquamarine  robot   3d       bored       cigarette    nan

so when I tokenize this string, with below code

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj
    else:
        return obj

tokenized_text = (tokenize(text))

I get the output

['orange', 'robot', 'x', 'eyes', 'discomfort', 'striped', 'tee', nan,'orange', 'robot', 'blue', 'beams', 'grin', 'vietnam', 'jacket', nan,'aquamarine', 'robot', '3d', 'bored', 'cigarette', nan, 'sea', 'captains', 'hat']

which is quite different from the output I expected

[['orange'], ['robot'], ['x', 'eyes'], ['discomfort'], ['striped', 'tee'], nan]
[['orange'], ['robot'], ['blue', 'beams'], ['grin'], ['vietnam', 'jacket'], nan]
[['aquamarine'], ['robot'], ['3d'], ['bored', 'cigarette'], nan, ['sea', 'captains', 'hat']]

Any ideas on how can I get the output I expected? Any help would be greatly appreciated!

Solution

Don't convert DataFrame to string but work with every text in DataFrame separatelly.

Use.applymap(function) to execute function on every text (on every cell in DataFrame).

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

Minimal working example:

import pandas as pd
from nltk.tokenize import word_tokenize

data = {
    'Background': ['Orange', 'Orange', 'Aqua'], 
    'Fur': ['Robot', 'Robot', 'Robot'], 
    'Eyes': ['X Eyes', 'Blue Beams', '3d'],
    'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
    'Clothes': ['Striped Tee', 'Vietman Jacket', None],
    'Hat': [None, None, "Sea Captain's Hat"],
}

df = pd.DataFrame(data)

print(df.to_string())  # `to_string()` to display full dataframe without `...`

# ----------------------------------------

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj]
    else:
        return obj

new_df = df.applymap(tokenize)

result = new_df.values.tolist()

print(result)

Result:

  Background    Fur        Eyes            Mouth         Clothes                Hat
0     Orange  Robot      X Eyes       Discomfort     Striped Tee               None
1     Orange  Robot  Blue Beams             Grin  Vietman Jacket               None
2       Aqua  Robot          3d  Bored Cigarette            None  Sea Captain's Hat

[
  [['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None], 
  [['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None], 
  [['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
]