Search code examples
pythondataframenlptokenizetextblob

How to word_tokenize pandas dataframe


My pandas dataframe (df.tweet) consits of one column with german tweets, I already did the data cleaning and dropped the columns I don´t need. Now I want to word_tokenize the tweets in the pandas dataframe. With TextBlob it only works for strings and I´m only able to tokenize the dataframe string by string (see code below). I used textblob-de because it tokenizes german text.

Is there an opportunity to getting the tokenization done for the whole dataframe with a for loop? I´m new to Python and NLP and really stack at that point. Some help would be great!

This is what I have:

pip install -U textblob-de
from textblob_de import TextBlobDE as TextBlob
TextBlob(df.tweet [1]).words

Solution

  • This should work. However, TextBlob/NLTK is not the greatest at tokenizing compared to others like spaCy or (especially) stanza. I'd recommend you to use those.

    from textblob_de import TextBlobDE as TextBlob
    df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))