Search code examples
pythonstringlistnlptokenize

splitting string made out of dataframe row wise


I'm trying to tokenize the words within dataframe which looks like

  A            B       C          D            E           F
0 Orange     robot   x eyes   discomfort   striped tee    nan
1 orange     robot  blue beams   grin      vietnam jacket nan
2 aquamarine robot   3d          bored        cigarette   nan   
     

After removing all the special characters the dataframe became a string like this

df_str = df.to_string(header=False)
    
import re

normalised_text = bayc_features_str.lower()
text = re.sub(r"[^\a-zA-Z0-9 ]","", normalised_text)

print(text)

    1    orange   robot   x eyes   discomfort   striped tee   nan
    2    orange   robot   blue beams   grin   vietnam jacket  nan
    3    aquamarine  robot   3d       bored       cigarette    nan   

so when I tokenize this string, with below code

def tokenize(obj):
    if obj is None:
        return None
    elif isinstance(obj, str): 
        return word_tokenize(obj)
    elif isinstance(obj, list):
        return [tokenize(i) for i in obj
    else:
        return obj

tokenized_text = (tokenize(text))

I get the output

['orange', 'robot', 'x', 'eyes', 'discomfort', 'striped', 'tee', nan,'orange', 'robot', 'blue', 'beams', 'grin', 'vietnam', 'jacket', nan,'aquamarine', 'robot', '3d', 'bored', 'cigarette', nan, 'sea', 'captains', 'hat']

which is quite different from the output I expected

[['orange'], ['robot'], ['x', 'eyes'], ['discomfort'], ['striped', 'tee'], nan]
[['orange'], ['robot'], ['blue', 'beams'], ['grin'], ['vietnam', 'jacket'], nan]
[['aquamarine'], ['robot'], ['3d'], ['bored', 'cigarette'], nan, ['sea', 'captains', 'hat']]

Any ideas on how can I get the output I expected? Any help would be greatly appreciated!


Solution

  • Don't convert DataFrame to string but work with every text in DataFrame separatelly.

    Use.applymap(function) to execute function on every text (on every cell in DataFrame).

    new_df = df.applymap(tokenize)
    
    result = new_df.values.tolist()
    

    Minimal working example:

    import pandas as pd
    from nltk.tokenize import word_tokenize
    
    data = {
        'Background': ['Orange', 'Orange', 'Aqua'], 
        'Fur': ['Robot', 'Robot', 'Robot'], 
        'Eyes': ['X Eyes', 'Blue Beams', '3d'],
        'Mouth': ['Discomfort', 'Grin', 'Bored Cigarette'],
        'Clothes': ['Striped Tee', 'Vietman Jacket', None],
        'Hat': [None, None, "Sea Captain's Hat"],
    }
    
    df = pd.DataFrame(data)
    
    print(df.to_string())  # `to_string()` to display full dataframe without `...`
    
    # ----------------------------------------
    
    def tokenize(obj):
        if obj is None:
            return None
        elif isinstance(obj, str): 
            return word_tokenize(obj)
        elif isinstance(obj, list):
            return [tokenize(i) for i in obj]
        else:
            return obj
    
    new_df = df.applymap(tokenize)
    
    result = new_df.values.tolist()
    
    print(result)
    

    Result:

      Background    Fur        Eyes            Mouth         Clothes                Hat
    0     Orange  Robot      X Eyes       Discomfort     Striped Tee               None
    1     Orange  Robot  Blue Beams             Grin  Vietman Jacket               None
    2       Aqua  Robot          3d  Bored Cigarette            None  Sea Captain's Hat
    
    [
      [['Orange'], ['Robot'], ['X', 'Eyes'], ['Discomfort'], ['Striped', 'Tee'], None], 
      [['Orange'], ['Robot'], ['Blue', 'Beams'], ['Grin'], ['Vietman', 'Jacket'], None], 
      [['Aqua'], ['Robot'], ['3d'], ['Bored', 'Cigarette'], None, ['Sea', 'Captain', "'s", 'Hat']]
    ]