Search code examples
pythontwittersplittweepytwitter-streaming-api

How to split twitter streaming data and append the text to csv file?


I have a script that streams twitter data filtered by one keyword. It streams the data into a csv file but the tweets have a multitude of objects attached to it e.g. Id, created_at, text, source etc.

I only need a few of these objects to append to the csv file but even after splitting the data and appending only the text object, some tweets appear with all the tweet objects appended. It seems the tweets that are retweets splits fine, normal tweets however do not split.

This is my code:

ckey = 'xxxxxxxxxxxxxx'
csecret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxx'
atoken = 'xxxxxxxxxxxxxxxxx-xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'
asecret = 'xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx'

auth = OAuthHandler(ckey, csecret)
auth.set_access_token(atoken, asecret)
api = tweepy.API(auth)

def dateRange(start, end):
    current = start
    while(end - current).days >=0:
        yield current
        current = current + datetime.timedelta(seconds = 1)


class Tweetlistener(StreamListener):
    def on_data(self, data):
        startdate = datetime.datetime(2016,6,1)
        enddate = datetime.datetime(2016,6,7)
        for date in dateRange(startdate, enddate):
            try:
                ##This is where I split the data
                tweet = data.split(',"text":"')[1].split('","source')[0]


                saveThis = str(time.time())+'::'+tweet
                saveFile = open('test.csv', 'a')
                saveFile.write(saveThis)
                saveFile.write('\n')
                return True
            except ValueError:
                print("Something went wrong with streaming")
        saveFile.close()

    def on_error(self, status):
        print(status)


twitterStream = Stream(auth, Tweetlistener(), secure = True)
twitterStream.filter(track=['brexit'])

This is the result in the csv file The first cell is a retweet and it splits as i intend it to, the cell below isn't a retweet and it appends all tweet objects

The first cell is a retweet and it splits as i intend it to, the cell below isn't a retweet and it appends all tweet objects.

How would I be able to split the data and only append the text, created_at, retweet_count, location, co-ordinates?

EDIT:

This is the raw data thats put in to one row per tweet (not my data, just an example i found online):

{
 'contributors': None, 
 'truncated': False, 
 'text': 'My Top Followers in 2010: @tkang1 @serin23 @uhrunland @aliassculptor @kor0307 @yunki62. Find yours @ http://mytopfollowersin2010.com',
 'in_reply_to_status_id': None,
 'id': 21041793667694593,
 '_api': ,
 'author': ,
 'retweeted': False,
 'coordinates': None,
 'source': 'My Top Followers in 2010',
 'in_reply_to_screen_name': None,
 'id_str': '21041793667694593',
 'retweet_count': 0,
 'in_reply_to_user_id': None,
 'favorited': False,
 'retweeted_status': ,
 'source_url': 'http://mytopfollowersin2010.com', 
 'user': ,
 'geo': None, 
 'in_reply_to_user_id_str': None, 
 'created_at': datetime.datetime(2011, 1, 1, 3, 15, 29), 
 'in_reply_to_status_id_str': None, 
 'place': None

}

I would want my data to be one tweet per row in this kind of format:

"created_at":Wed Aug 27 13:08:45 +0000 2008::"text"::Example tweet::"retweet_count":154::"favorite_count":200::"coordinates":[-75.14310264,40.05701649]

Where '::' differentiates the objects.


Solution

  • I think you're getting JSON data, so naturally, viewing such a file in Excel as a CSV is a bad idea.

    So is this line

    tweet = data.split(',"text":"')[1].split('","source')[0]
    

    You need to instead parse the keys. For example

    import json, csv 
    
    def on_data(self, data):
        tweet = json.loads(data) 
        text = tweet["text"]
        source = tweet["source"]
        with open('test.csv', 'a') as f:
            writer = csv.writer(f)
            writer.writerow([text, source]) 
    

    The idea is rather than slice the string apart based on certain strings, actually use its existing structure to your advantage, then extract the necessary fields, by name

    Sidenote, personally I find that opening and closing a file for every message is operationally expensive, so I would suggest finding a way to only do that once the stream starts and stops