Search code examples
pythondata-files

Easy way to compute Large data file python


I have to compute the data from a large file. File has around 100000 rows and 3 columns. The Program below works great with a small test file but when trying to run with a large file it takes ages to display even one result. Any suggestions to speed the loading and computing of large data file.

Code: Computation is perfect with small test file, input format given below

from collections import defaultdict
paircount = defaultdict(int)
pairtime = defaultdict(float)
pairper = defaultdict(float)

#get number of pair occrences and total time 
with open('input.txt', 'r') as f:
  with open('output.txt', 'w') as o: 
    numline = 0
    for line in f:
        numline += 1
            line = line.split()
        pair = line[0], line[1]
        paircount[pair] += 1
        pairtime[pair] += float(line[2])
        pairper = dict((pair, c * 100.0 / numline) for (pair, c) in paircount.iteritems())

    for pair, c in paircount.iteritems():
        #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
        o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, pairper[pair], pairtime[pair]))

Inputfile:

5372 2684 460.0
1885 1158 351.0
1349 1174 6375.0
1980 1174 650.0
1980 1349 650.0
4821 2684 469.0
4821 937  459.0
2684 937  318.0
1980 606  390.0
1349 606  750.0
1174 606  750.0

Solution

  • The pairper calculation is killing you and is not needed. You can use enumerate to count the input lines and just use that value at the end. This is similar to martineau's answer except that it doesn't pull the entire input list into memory (bad idea) or even calcuate pairper at all.

    from collections import defaultdict
    paircount = defaultdict(int)
    pairtime = defaultdict(float)
    
    #get number of pair occrences and total time 
    with open('input.txt', 'r') as f:
      with open('output.txt', 'w') as o: 
        for numline, line in enumerate(f, 1):
            line = line.split()
            pair = line[0], line[1]
            paircount[pair] += 1
            pairtime[pair] += float(line[2])
    
        for pair, c in paircount.iteritems():
            #print pair[0], pair[1], c, pairper[pair], pairtime[pair]
            o.write("%s, %s, %s, %s, %s\n" % (pair[0], pair[1], c, c * 100.0 / numline, pairtime[pair]))