TextBlob not returning the correct number of instances of string in Pandas dataframe

For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.

I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.

    import pandas as pd
    import csv
    df = pd.read_csv('tweets_hiv.csv')
    saved_column4 = df.text
    print saved_column4

Out comes the correct output:

    0                                Some example tweet text
    1                 Oh hey look more tweet text @things I hate #stuff
    ...a bunch more lines
    Name: text, Length: 8540, dtype: object

But, when I try this

    from textblob import TextBlob
    tweetstr = str(saved_column4)
    tweets = TextBlob(tweetstr).upper()
    print tweets.words.count('sex', case_sensitive=False)

My output is 22.

There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?

Solution

I'm not entirely sure this is methodically correct insofar as language processing, but using join will give you the count you need.

import pandas as pd
from textblob import TextBlob

tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweetstr = " ".join(tweets.tolist())
tweetsb = TextBlob(tweetstr).upper()
print tweetsb.words.count('sex', case_sensitive=False)
# 1000

If you just need the count without necessarily using TextBlob, then just do:

import pandas as pd

tweets = pd.Series('sex {}'.format(x) for x in range(1000))
sex_tweets = tweets.str.contains('sex', case=False)
print sex_tweets.sum()
# 1000

You can get a TypeError in the first snippet if one of your elements is not of type string. This is more of join issue. A simple test can be done using the following snippet:

# tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweets = pd.Series(x for x in range(1000))
tweetstr = " ".join(tweets.tolist())

Which gives the following result:

Traceback (most recent call last):
  File "F:\test.py", line 6, in <module>
    tweetstr = " ".join(tweets.tolist())
TypeError: sequence item 0: expected string, numpy.int64 found

A simple workaround is to convert x in the list comprehension into a string before using join, like so:

tweets = pd.Series(str(x) for x in range(1000))

Or you can be more explicit and create a list first, map the str function to it, and then use join.

tweetlist = tweets.tolist()
tweetstr = map(str, tweetlist)
tweetstr = " ".join(tweetstr)