Search code examples
pythonpipelinedata-processing

data processing pipeline python


I am working on the following problem. Lets say I have data (say image values RGB as integers) in a file per line. I want to read 10000 of these lines and make a frame object (image frame containing 10000 RGB Values) and send it to downstream function in the processing pipeline. Then read the next 10000 lines and make another frame object and send it to downstream function in the pipeline.

How can i setup this function that it keeps on making frame objects until the end of file is reached. Is the following the right way to do it? Are there other neat approaches?

class frame_object(object):
    def __init__(self):
            self.line_cnt  = 0
            self.buffer = []

    def make_frame(line):
        if(self.line_cnt < 9999):
            self.buffer.append(line)
        return self.buffer

Solution

  • You could use generators to create a data pipeline like in the following example:

    FRAME_SIZE = 10000
    
    
    def gen_lines(filename):
        with open(filename, "r") as fp:
            for line in fp:
                yield line[:-1]
    
    
    def gen_frames(lines):
        count = 0
        frame = []
    
        for line in lines:
            if count < FRAME_SIZE:
                frame.append(line)
                count += 1
    
            if count == FRAME_SIZE:
                yield frame
                frame = []
                count = 0
    
        if count > 0:
            yield frame
    
    
    def process_frames(frames):
        for frame in frames:
            # do stuff with frame
            print len(frame)
    
    
    lines = gen_lines("/path/to/input.file")
    frames = gen_frames(lines)
    process_frames(frames)
    

    In this way it's easier to see the data pipeline and hook in different processing or filtering logic. You can learn more on generators and their use in data-processing pipelines here.