Search code examples
pythonpython-3.xfileoptimizationrefactoring

Optimize Searching method in File at Python


I am trying to optimize the method below. It is the core of my project(as the % of time in the method is close to 95%). It reads a line of file and if the tid is in the line it returns the first number, which is the document id. A few lines of the file for example:

5168  268:0.0482384162801528 297:0.0437108092315354 352:0.194373864228161
5169  268:0.0444310314892627 271:0.114435072663748 523:0.0452228057908503

The current implementation uses the method tid_add_colon_in_front(tid) as the tid is just a string, and the did_tids_file is the file that has the data (has been opened already)

Any ideas as to how I can improve it any further will be welcome!

def dids_via_tid(tid) -> set:
    did_tids_file.seek(0)
    dids = set()

    #To find the term ids in the file
    tid = tid_add_colon_in_front(tid)
    did_str = ""

    #Τo not do line.split
    for line in did_tids_file:
      did_str = ""

      if tid in line:
        for char in line:
          if char == " ":
            break
          did_str += char

        dids.add(did_str)
    return dids

My previous implementation was with line.split, which return's a list and by my current knowledge is heavier in memory and time when dealing with very big amounts of data.

Also, I have tried reading data from the file with the readLine as below, but it didnt improve the performance

line = myFile.readLine()
while line:
 #Do work
 line = myFile.readLine()

Solution

  • If you have multiple tids that you are looking for, you 100% should be searching for them all during 1 pass through the file. It will be much faster if your file size is 100K plus lines.

    # tid line finder
    
    import re
    from collections import defaultdict
    
    
    def tid_searcher(filename, tids_of_interest):
        res = defaultdict(list)
        with open(filename, 'r') as src:
            for line in src:
                line_tids = set(re.findall(r'(\d+):', line)) # re:  group of one or more digits followed by colon
                hits = tids_of_interest & line_tids  # set intersection
                if hits:
                    line_no = re.search(r'\A\d+', line).group(0) # re: one or more digits at start of string
                    for hit in hits:
                        res[hit].append(line_no)
    
        return res
    
    tids_of_interest = {'268', '271'}
    filename = 'data.txt'
    
    print(tid_searcher(filename, tids_of_interest))
    
    # defaultdict(<class 'list'>, {'268': ['5168', '5169'], '271': ['5169']})