My pandas dataframe (df.tweet
) consits of one column with german tweets, I already did the data cleaning and dropped the columns I don´t need. Now I want to word_tokenize the tweets in the pandas dataframe.
With TextBlob it only works for strings and I´m only able to tokenize the dataframe string by string (see code below). I used textblob-de because it tokenizes german text.
Is there an opportunity to getting the tokenization done for the whole dataframe with a for loop? I´m new to Python and NLP and really stack at that point. Some help would be great!
This is what I have:
pip install -U textblob-de
from textblob_de import TextBlobDE as TextBlob
TextBlob(df.tweet [1]).words
This should work. However, TextBlob/NLTK is not the greatest at tokenizing compared to others like spaCy or (especially) stanza. I'd recommend you to use those.
from textblob_de import TextBlobDE as TextBlob
df["tweet_tok"] = df["tweet"].apply(lambda x: " ".join(TextBlob(x).words))