I have 400 million tweets ( actually I think its almost like 450 but never mind ) , in the form :
T "timestamp"
U "username"
W "actual tweet"
I want to write them to a file initially in the form "username \t tweet" and then load into a DB . The problem is that before loading into a db, there are a few things I need to do : 1. Preprocess the tweet to remove RT@[names] and urls 2. Take the username out of "http://twitter.com/username".
I am using python and this is the code . Please let me know how this can be made faster :)
'''The aim is to take all the tweets of a user and store them in a table. Do this for all the users and then lets see what we can do with it
What you wanna do is that you want to get enough information about a user so that you can profile them better. So , lets get started
'''
def regexSub(line):
line = re.sub(regRT,'',line)
line = re.sub(regAt,'',line)
line = line.lstrip(' ')
line = re.sub(regHttp,'',line)
return line
def userName(line):
return line.split('http://twitter.com/')[1]
import sys,os,itertools,re
data = open(sys.argv[1],'r')
processed = open(sys.argv[2],'w')
global regRT
regRT = 'RT'
global regHttp
regHttp = re.compile('(http://)[a-zA-Z0-9]*.[a-zA-Z0-9/]*(.[a-zA-Z0-9]*)?')
global regAt
regAt = re.compile('@([a-zA-Z0-9]*[*_/&%#@$]*)*[a-zA-Z0-9]*')
for line1,line2,line3 in itertools.izip_longest(*[data]*3):
line1 = line1.split('\t')[1]
line2 = line2.split('\t')[1]
line3 = line3.split('\t')[1]
#print 'line1',line1
#print 'line2=',line2
#print 'line3=',line3
#print 'line3 before preprocessing',line3
try:
tweet=regexSub(line3)
user = userName(line2)
except:
print 'Line2 is ',line2
print 'Line3 is',line3
#print 'line3 after processig',line3
processed.write(user.strip("\n")+"\t"+tweet)
I ran the code in the following manner:
python -m cProfile -o profile_dump TwitterScripts/Preprocessing.py DATA/Twitter/t082.txt DATA/Twitter/preprocessed083.txt
This is the output I get : ( Warning : its pretty big and I did not filter out the small values, thinking, they may also hold some significance )
Sat Jan 7 03:28:51 2012 profile_dump
3040835560 function calls (3040835523 primitive calls) in 2500.613 CPU seconds
Ordered by: call count
ncalls tottime percall cumtime percall filename:lineno(function)
528840744 166.402 0.000 166.402 0.000 {method 'split' of 'str' objects}
396630560 81.300 0.000 81.300 0.000 {method 'get' of 'dict' objects}
396630560 326.349 0.000 439.737 0.000 /usr/lib64/python2.7/re.py:229(_compile)
396630558 255.662 0.000 1297.705 0.000 /usr/lib64/python2.7/re.py:144(sub)
396630558 602.307 0.000 602.307 0.000 {built-in method sub}
264420442 32.087 0.000 32.087 0.000 {isinstance}
132210186 34.700 0.000 34.700 0.000 {method 'lstrip' of 'str' objects}
132210186 27.296 0.000 27.296 0.000 {method 'strip' of 'str' objects}
132210186 181.287 0.000 1513.691 0.000 TwitterScripts/Preprocessing.py:4(regexSub)
132210186 79.950 0.000 79.950 0.000 {method 'write' of 'file' objects}
132210186 55.900 0.000 113.960 0.000 TwitterScripts/Preprocessing.py:10(userName)
313/304 0.000 0.000 0.000 0.000 {len}
Removed the ones which were really low ( like 1, 3 and so on)
Please tell me what other changes can be made. Thanks !
This is what multiprocessing is for.
You have a pipeline that can be broken into a large number of small steps. Each step is a Process
which does to get for an item from the pipe, does a small transformation and puts an intermediate result to the next pipe.
You'll have a Process
which reads the raw file three lines at a time, and the puts the three lines into a Pipe. That's all.
You'll have a Process
which gets a (T,U,W) triple from the pipe, cleans up the user line, and puts it into the next pipe.
Etc., etc.
Don't build too many steps to start with. Read - transform - Write is a good beginning to be sure you understand the multiprocessing
module. After that, it's an empirical study to find out what the optimum mix of processing steps is.
When you fire this thing up, it will spawn a number of communicating sequential processes that will consume all of your CPU resources but process the file relatively quickly.
Generally, more processes working concurrently is faster. You eventually reach a limit because of OS overheads and memory limitations.