Search code examples
python-3.xtextawklarge-data

How can I find the largest number in a very large text file (~150 GB)?


I have a text file that has around 100000000 lines, each of the following type:

string num1 num2 num3 ... num500
string num1 num2 num3 ... num40

I want to find the largest number present in this file.

My current code reads each line, splits it by space, and stores the largest number in the current line. Then, I compare it with the largest number of the next line, and retain the larger of the two.

with open(filename,'r') as f:
    prev_max = -1
    for line in f:
        line = [int(n) for n in line.split(' ')[1:]]
        max = max_num(line)
        if max > prev_max:
            prev_max = max

But this takes forever. Is there a better way to do this?

I am open to solutions with awk or other shell commands as well.

Edit: Added how I am reading the file.


Solution

  • It's a trivial task for awk.

    awk 'NR==1{m=$2} {for(i=2;i<=NF;++i) if(m<$i) m=$i} END{print m}' file
    

    If it's guaranteed that your file is not all zeroes or negative numbers, you can drop NR==1{m=$2} part.