I have a large csv file with more than 1 million rows. Each row has two features, callsite (the location of an API invocation) and a sequence of tokens to the callsite. They are written as:
callsite 1, token 1, token 2, token 3, ...
callsite 1, token 3, token 4, token 4, token 6, ...
callsite 2, token 3, token 1, token 6, token 7, ...
I want to shuffle the rows and split them into two files (for training and testing). The problem is that I want to split than according to the callsites instead of the rows. There may be more than one row belonging to one callsite. So I first read all the callsites, shuffle and split them as follows:
import csv
import random
with open(file,'r') as csv_file:
reader = csv.reader(csv_file)
callsites = [row[0] for row in reader]
random.shuffle(callsites)
test_callsites = callsites[0:n_test] //n_test is the number of test cases
Then, I read each row from the csv file and compare the callsite to put it in the train.csv or test.csv as follows:
with open(file,'r') as csv_file, open('train.csv','w') as train_file, open('test.csv','w') as test_file:
reader = csv.reader(csv_file)
train_writer = csv.writer(train_file)
test_writer = csv.writer(test_file)
for row in reader:
if row[0] in test_callsites:
test_writer.writerow(row)
else:
train_writer.writerow(row)
The problem is that the code works too slow, more than one day to finish. The comparison for each row causes the complexity O(n^2). And the read and write row by row may also be not efficient. But I am afraid that loading all data in the memory would cause memory error. Is there any better way to deal with large files like that?
Would it be quicker if I use dataframe to read and write it? But the sequence length is varied each row. I tried to write the data as (put all tokens as a list in one column):
callsite, sequence
callsite 1, [token1||token2||token 3]
However, it seems not convenient to restore the [token 1||token 2||token 3] as sequences. Is there any better practice to store and restore that kind of data with variable length?
The simplest fix is to change:
test_callsites = callsites[0:n_test]
to
test_callsites = frozenset(callsites[:n_test]) # set also works; frozenset just reduces chance of mistakenly modifying it
This would reduce the work for each test of if row[0] in test_callsites:
from O(n_test)
to O(1)
, which would likely make a huge improvement if n_test
is on the order of four digits or more (likely, when we're talking about millions of rows).
You could also slightly reduce the work (mostly in terms of improving memory locality by having a smaller bin of things being selected) in creating it in the first place by changing:
random.shuffle(callsites)
test_callsites = callsites[0:n_test]
to:
test_callsites = frozenset(random.sample(callsites, n_test))
which avoids reshuffling the whole of callsites
in favor of selecting n_test
values from it (which you then convert to a frozenset
, or just set
, for cheap lookup). Bonus, it's a one-liner. :-)
Side-note: Your code is potentially wrong as written. You must pass newline=''
to your various calls to open
to ensure that the chosen CSV dialect's newline preferences are honored.