Search code examples
pythonarraysloopsiterationmemory-mapped-files

Python: Iterating Large Files


I am a beginner and will appreciate any alternatives to handle my problem. Simply put, I have two files, containing one vector each. Aim is to subtract all the elements of file 2 from file 1; for all possible combinations. Everything is fine for small vectors, everything is fine, but the processing time is huge for larger file such as with million elements in each file.

Given below is the minimal working example. I heard about memorymapping and would appreciate if you can share a modified version or any relevant pointers to handle this issue.

import matplotlib.pyplot as plt
import numpy as np

file1 = np.loadtxt('File_1.txt',delimiter=',')
file2 = np.loadtxt('File_2.txt',delimiter=',')
event1 = file1[:,1]
event2 = file2[:,1]

r1 = len(event1)
r2 = len(event2)

diff = []

for i in range(0,r1):

    for j in range(0,r2):
        delta = float(event1[i]-event2[j])

        if delta >=-4000 and delta <=4000:
            diff = np.append(diff, delta)
            

np.savetxt('export.txt', diff)

Solution

  • Here is a solution that is O(n log m + number_of_matches). This can still default back to O(n*m) since the number of possible outputs is that big (if all elements are close to each other in value). If there are few matches this would be much faster:

    #two pointer approach
    
    event1 = sorted(event1)
    event2 = sorted(event2)
    
    
    diff = []
    
    #smallest 
    if len(event1) > len(event2):
        event1, event2 = event2,event1
    
    left  = 0
    right = 0
    for e in event1:
        # move left pointer so that is within -delta of e
        while left < len(event2) and event2[left] - e < -4000:
            left +=1
        #move right pointer so that is outside of +delta
        while right < len(event2) and event2[right] - e <= 4000:
            right +=1
    
        for i in range(left,right):
            diff.append(e - event2[i])
    

    Testing on my machine, this is about 6 times faster on the example files. It will be a lot faster if delta is relatively small compared to the numbers (few hits) and approximately the same (or even slower) if delta is very big (many hits).