Search code examples
pythonpandastextblob

TextBlob not returning the correct number of instances of string in Pandas dataframe


For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.

I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.

    import pandas as pd
    import csv
    df = pd.read_csv('tweets_hiv.csv')
    saved_column4 = df.text
    print saved_column4

Out comes the correct output:

    0                                Some example tweet text
    1                 Oh hey look more tweet text @things I hate #stuff
    ...a bunch more lines
    Name: text, Length: 8540, dtype: object

But, when I try this

    from textblob import TextBlob
    tweetstr = str(saved_column4)
    tweets = TextBlob(tweetstr).upper()
    print tweets.words.count('sex', case_sensitive=False)

My output is 22.

There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?


Solution

  • I'm not entirely sure this is methodically correct insofar as language processing, but using join will give you the count you need.

    import pandas as pd
    from textblob import TextBlob
    
    tweets = pd.Series('sex {}'.format(x) for x in range(1000))
    tweetstr = " ".join(tweets.tolist())
    tweetsb = TextBlob(tweetstr).upper()
    print tweetsb.words.count('sex', case_sensitive=False)
    # 1000
    

    If you just need the count without necessarily using TextBlob, then just do:

    import pandas as pd
    
    tweets = pd.Series('sex {}'.format(x) for x in range(1000))
    sex_tweets = tweets.str.contains('sex', case=False)
    print sex_tweets.sum()
    # 1000
    

    You can get a TypeError in the first snippet if one of your elements is not of type string. This is more of join issue. A simple test can be done using the following snippet:

    # tweets = pd.Series('sex {}'.format(x) for x in range(1000))
    tweets = pd.Series(x for x in range(1000))
    tweetstr = " ".join(tweets.tolist())
    

    Which gives the following result:

    Traceback (most recent call last):
      File "F:\test.py", line 6, in <module>
        tweetstr = " ".join(tweets.tolist())
    TypeError: sequence item 0: expected string, numpy.int64 found
    

    A simple workaround is to convert x in the list comprehension into a string before using join, like so:

    tweets = pd.Series(str(x) for x in range(1000))
    

    Or you can be more explicit and create a list first, map the str function to it, and then use join.

    tweetlist = tweets.tolist()
    tweetstr = map(str, tweetlist)
    tweetstr = " ".join(tweetstr)