pythonalgorithmoptimizationfilterquantile

Python filter larger text by quantile


Assume I am process a very large text file, I have the following pseudocode

xx_valueList = []
lines=[]
with line in file: 
    xx_value = calc_xxValue(line)
    xx_valueList.append(xx_value)
    lines.append(lines)

# get_quantile_value is a function return the cutoff value with a specific quantile precent
cut_offvalue = get_quantile_value(xx_valueList, precent=0.05)
for line in lines: 
    if calc_xxValue(line) > cut_offvalue: 
         # do someting here

Note that the file is very large and may come from a pipe, so I don't want to read it twice.

We must read the entire file before we can get the cutoff to filter file

The above method can work, but it consumes too much memory, is there some algorithmic optimization that can improve efficiency and reduce memory consumption?


Solution

  • xx_value_list = []
    cut_offvalue = 0
    with open(file, 'r') as f:
        for line in f:
            xx_value = calc_xxValue(line)
            xx_value_list.append(xx_value)
            if len(xx_value_list) % 100 == 0:
                cut_offvalue = get_quantile_value(xx_value_list, precent=0.05)
            if xx_value < cut_offvalue: 
                # do something here
                pass