Search code examples
pythonreadlines

Python: processing lines of a large document on the fly


I have a document that looks a bit like this:

key1 value_1_1 value_1_2 value_1_3 etc
key2 value_2_1 value_2_2 value_2_3 etc
key3 value_3_1 value_3_2 value_3_3 etc
etc

Where each key is a string and each value is a float, all separated by spaces. Each line has hundreds of values associated with it, and there are hundreds of thousands of lines. Each line needs to be processed in a particular way, but because my program will only ever need the information from a small fraction of the lines, it seems like a giant waste of time to immediately process each line. Currently, I just have a list of each unprocessed line, and maintain a separate list containing each key. When I need to access a line I'll use the key list to find the index of the line I need, then process the line at that index in the lines list. My program may potentially call for looking up the same line multiple times, which would result in redundantly processing the same line over and over again, but still seems better than processing every single line right from the start.

My question is, is there a more efficient way to do what I'm doing?

(and please let me know if I need to make any clarifications)

Thanks!


Solution

  • First I would store your lines in a dict. This probably makes lookups based on the key a lot faster. Making this dict can be as simple as d = dict(line.split(' ', 1) for line in file_obj). If the keys have a fixed width for example you could speed this up even a bit more by just slicing the lines.

    Next, if the line processing is very computationally heavy, you could buffer the results. I worked this out once by subclassing a dict:

    class BufferedDict(dict):
        def __init__(self, file_obj):
            self.file_dict = dict(line.split(' ', 1) for line in file_obj)
    
        def __getitem__(self, key):
            if key not in self:
                self[key] = process_line(self.file_dict[key])
            return super(BufferedDict, self).__getitem__(key)
    
    def process_line(line):
        """Your computationally heavy line processing function"""
    

    This way, if you call my_buffered_dict[key], the line will be processed only if the processed version wasn't available yet.