For a project for my lab, I'm analyzing Twitter data. The tweets we've captured all have the word 'sex' in them, that's the keyword we filtered the TwitterStreamer to capture based on.
I converted the CSV where all of the tweet data (json metatags) is housed into a pandas DB and saved the 'text' column to isolate the tweet text.
import pandas as pd
import csv
df = pd.read_csv('tweets_hiv.csv')
saved_column4 = df.text
print saved_column4
Out comes the correct output:
0 Some example tweet text
1 Oh hey look more tweet text @things I hate #stuff
...a bunch more lines
Name: text, Length: 8540, dtype: object
But, when I try this
from textblob import TextBlob
tweetstr = str(saved_column4)
tweets = TextBlob(tweetstr).upper()
print tweets.words.count('sex', case_sensitive=False)
My output is 22
.
There should be AT LEAST as many incidences of the word 'sex' as there are lines in the CSV, and likely more. I can't figure out what's happening here. Is TextBlob not configuring right around a dtype:object?
I'm not entirely sure this is methodically correct insofar as language processing, but using join
will give you the count you need.
import pandas as pd
from textblob import TextBlob
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweetstr = " ".join(tweets.tolist())
tweetsb = TextBlob(tweetstr).upper()
print tweetsb.words.count('sex', case_sensitive=False)
# 1000
If you just need the count without necessarily using TextBlob
, then just do:
import pandas as pd
tweets = pd.Series('sex {}'.format(x) for x in range(1000))
sex_tweets = tweets.str.contains('sex', case=False)
print sex_tweets.sum()
# 1000
You can get a TypeError
in the first snippet if one of your elements is not of type string
. This is more of join
issue. A simple test can be done using the following snippet:
# tweets = pd.Series('sex {}'.format(x) for x in range(1000))
tweets = pd.Series(x for x in range(1000))
tweetstr = " ".join(tweets.tolist())
Which gives the following result:
Traceback (most recent call last):
File "F:\test.py", line 6, in <module>
tweetstr = " ".join(tweets.tolist())
TypeError: sequence item 0: expected string, numpy.int64 found
A simple workaround is to convert x
in the list comprehension into a string
before using join
, like so:
tweets = pd.Series(str(x) for x in range(1000))
Or you can be more explicit and create a list first, map the str
function to it, and then use join
.
tweetlist = tweets.tolist()
tweetstr = map(str, tweetlist)
tweetstr = " ".join(tweetstr)