Search code examples
pythonpython-3.xpathlib

Fastest Method to read many files line by line in Python


I have a conceptual question. I am new to Python and I am looking to do task that involves processing bigger log files. Some of these can get up to 5 and 6GB

I need to parse through many files in a location. These are text files.

I know of the with open() method, and recently just ran into pathlib. So I need to not only read the file line by line to extract values to upload into a DB, i also need to get file properties that Pathlib gives you and upload them as well.

Is it faster to use with open and underneath it, call a path object from which to read files... something like this:

for filename in glob('**/*.*', recursive=False):
    fpath = Path(filename)
    with open(filename, 'rb', buffering=102400) as logfile:
        for line in logfile:
            #regex operation
            print(line)

Or would it be better to use Pathlib:

with Path("src/module.py") as f:
    contents = open(f, "r")
    for line in contents:
        #regex operation
        print(line)

Also since I've never used Pathlib to open files for reading. When it comes to this: Path.open(mode=’r’, buffering=-1, encoding=None, errors=None, newline=None)

What does newline and errors mean? I assume buffering here is the same as buffering in the with open function?

I also saw this contraption that uses with open in conjuction with Path object though how it works, I have no idea:

path = Path('.editorconfig')
with open(path, mode='wt') as config:
    config.write('# config goes here')

Solution

  • pathlib is intended to be a more elegant solution to interacting with the file system, but it's not necessary. It'll add a small amount of fixed overhead (since it wraps other lower level APIs), but shouldn't change how performance scales in any meaningful way.

    Since, as noted, pathlib is largely a wrapper around lower level APIs, you should know Path.open is implemented in terms of open, and the arguments all mean the same thing for both; reading the docs for the built-in open will describe the arguments.

    As for the last bit of your question (passing a Path object to the built-in open), that works because most file-related APIs were updated to support any object that implements the os.PathLike ABC.