Search code examples
pythonfileiofilesizedisk

estimating number of lines in a file - mismatch between file size and size of all lines


I have a few hundred files, each a between 10s of MB and a few GB in size, and I'd like to estimate the number of lines (i.e. an exact count is not needed). Each line is very regular, for example something like 4 long ints and 5 double floats.

I tried to find the average size of the first AVE_OVER lines in a file, then use that to estimate the total number of lines:

nums = sum(1 for line in open(files[0]))
print "Number of lines = ", nums

AVE_OVER = 10
lineSize = 0.0
count = 0
for line in open(files[0]):
    lineSize += sys.getsizeof(line)
    count += 1
    if( count >= AVE_OVER ): break

lineSize /= count
fileSize = os.path.getsize(files[0])
numLines = fileSize/lineSize
print "Estimated number of lines = ", numLines

The estimate was way off:

> Number of lines =  505235
> Estimated number of lines =  324604.165863

So I tried counting the total size of all lines in the file, compared to the sys measured size:

fileSize = os.path.getsize(files[0])
totalLineSize = 0.0
for line in open(files[0]):
totalLineSize += sys.getsizeof(line)

print "File size = %.3e" % (fileSize)
print "Total Line Size = %.3e" % (totalLineSize)

But again these are discrepant!

> File size = 3.366e+07
> Total Line Size = 5.236e+07

Why is the sum of sizes of each lines so much larger than the actual total file size? How can I correct for this?


Edit: Algorithm I ended up with (ver 2.0); Thanks to @J.F.Sebastian

def estimateLines(files):
    """ Estimate the number of lines in the given file(s) """

    if( not np.iterable(files) ): files = [files]
    LEARN_SIZE = 8192

    # Get total size of all files                                                                                                                                                                   
    numLines = sum( os.path.getsize(fil) for fil in files )

    with open(files[0], 'rb') as file:
         buf = file.read(LEARN_SIZE)
         numLines /= (len(buf) // buf.count(b'\n'))

    return numLines

Solution

  • To estimate number of lines in a file:

    def line_size_hint(filename, learn_size=1<<13):
        with open(filename, 'rb') as file:
            buf = file.read(learn_size)
            return len(buf) // buf.count(b'\n')
    
    number_of_lines_approx = os.path.getsize(filename) // line_size_hint(filename)
    

    To find the exact number of lines, you could use wc-l.py script:

    #!/usr/bin/env python
    import sys
    from functools import partial
    
    print(sum(chunk.count('\n') for chunk in iter(partial(sys.stdin.read, 1 << 15), '')))