Search code examples
pythonpandasrpt

How to count the numbers of an rpt file in python without reading the document extensively?


I have quite a bunch of data; More prcisely, a 8 GB rpt file;

Now before processing it I want to know how many rows there actually are - this helps me to later find out how long the processing will take etc; Now reading an rpt file of that size in python as a whole obviously does not work so I need to read line by line; To find out the number of lines I wrote that simple python script:

import pandas as pd

counter=0

for line in pd.read_fwf("test.rpt", chunksize=1):
    counter=counter+1
print(counter)

This seems to work well - however I realized that it is quite slow and to really read all the lines is unnecessary;

Is there a way to get the number of rows without reading each line?

Many thanks


Solution

  • I'm not familiar with the .rpt file format, but if it can be read in as a text file (which I'm assuming it can if you're using pd.read_fwf) then you can probably just use Python's builtins for input/output.

    with open('test.rpt', 'r') as testfile:
        for i, line in enumerate(testfile):
            pass
        # Add one to get the line count
        print(i+1)
    

    This will allow you to (efficiently) iterate over each line of the file object. The builtin enumerate function will count each line as you read it.