How to improve Python C Extensions file line reading?

Originally asked on Are there alternative and portable algorithm implementation for reading lines from a file on Windows (Visual Studio Compiler) and Linux? but closed as too abroad, then, I am here trying to reduce its scope with a more concise case usage.

My goal is to implement my own file reading module for Python with Python C Extensions with a line caching policy. The purely Python Algorithm implementation without any line caching policy is this:

# This takes 1 second to parse 100MB of log data
with open('myfile', 'r', errors='replace') as myfile:
    for line in myfile:
        if 'word' in line: 
            pass

Resuming the Python C Extensions implementation: (see here the full code with line caching policy)

// other code to open the file on the std::ifstream object and create the iterator
...

static PyObject * PyFastFile_iternext(PyFastFile* self, PyObject* args)
{
    std::string newline;

    if( std::getline( self->fileifstream, newline ) ) {
        return PyUnicode_DecodeUTF8( newline.c_str(), newline.size(), "replace" );
    }

    PyErr_SetNone( PyExc_StopIteration );
    return NULL;
}

static PyTypeObject PyFastFileType =
{
    PyVarObject_HEAD_INIT( NULL, 0 )
    "fastfilepackage.FastFile" /* tp_name */
};

// create the module
PyMODINIT_FUNC PyInit_fastfilepackage(void)
{
    PyFastFileType.tp_iternext = (iternextfunc) PyFastFile_iternext;
    Py_INCREF( &PyFastFileType );

    PyObject* thismodule;
    // other module code creating the iterator and context manager
    ...

    PyModule_AddObject( thismodule, "FastFile", (PyObject *) &PyFastFileType );
    return thismodule;
}

And this is the Python code which uses the Python C Extensions code to open a file and read its lines one by one:

from fastfilepackage import FastFile

# This takes 3 seconds to parse 100MB of log data
iterable = fastfilepackage.FastFile( 'myfile' )
for item in iterable:
    if 'word' in iterable():
        pass

Right now the Python C Extensions code fastfilepackage.FastFile with C++ 11 std::ifstream takes 3 seconds to parse 100MB of log data, while the Python implementation presented takes 1 second.

The content of the file myfile are just log lines with around 100~300 characters on each line. The characters are just ASCII (module % 256), but due bugs on the logger engine, it can put invalid ASCII or Unicode characters. Hence, this is why I used the errors='replace' policy while opening the file.

I just wonder if I can replace or improve this Python C Extension implementation, reducing the 3 seconds time to run the Python program.

I used this to do the benchmark:

import time
import datetime
import fastfilepackage

# usually a file with 100MB
testfile = './myfile.log'

timenow = time.time()
with open( testfile, 'r', errors='replace' ) as myfile:
    for item in myfile:
        if None:
            var = item

python_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=python_time )
print( 'Python   timedifference', timedifference, flush=True )
# prints about 3 seconds

timenow = time.time()
iterable = fastfilepackage.FastFile( testfile )
for item in iterable:
    if None:
        var = iterable()

fastfile_time = time.time() - timenow
timedifference = datetime.timedelta( seconds=fastfile_time )
print( 'FastFile timedifference', timedifference, flush=True )
# prints about 1 second

print( 'fastfile_time %.2f%%, python_time %.2f%%' % ( 
        fastfile_time/python_time, python_time/fastfile_time ), flush=True )

Related questions:

Solution

Reading line by line is going to cause unavoidable slowdowns here. Python's built-in text oriented read-only file objects are actually three layers:

io.FileIO - Raw, unbuffered access to the file
io.BufferedReader - Buffers the underlying FileIO
io.TextIOWrapper - Wraps the BufferedReader to implement buffered decode to str

While iostream does perform buffering, it's only doing the job of io.BufferedReader, not io.TextIOWrapper. io.TextIOWrapper adds an extra layer of buffering, reading 8 KB chunks out of the BufferedReader and decoding them in bulk to str (when a chunk ends in an incomplete character, it saves off the remaining bytes to prepend to the next chunk), then yielding individual lines from the decoded chunk on request until it's exhausted (when a decoded chunk ends in a partial line, the remainder is prepended to the next decoded chunk).

By contrast, you're consuming a line at a time with std::getline, then decoding a line at a time with PyUnicode_DecodeUTF8, then yielding back to the caller; by the time the caller requests the next line, odds are at least some of the code associated with your tp_iternext implementation has left the CPU cache (or at least, left the fastest parts of the cache). A tight loop decoding 8 KB of text to UTF-8 is going to go extremely fast; repeatedly leaving the loop and only decoding a 100-300 bytes at a time is going to be slower.

The solution is to do roughly what io.TextIOWrapper does: Read in chunks, not lines, and decode them in bulk (preserving incomplete UTF-8 encoded characters for the next chunk), then search for newlines to fish out substrings from the decoded buffer until it's exhausted (don't trim the buffer each time, just track indices). When no more complete lines remain in the decoded buffer, trim the stuff you've already yielded, and read, decode, and append a new chunk.

There is some room for improvement on Python's underlying implementation of io.TextIOWrapper.readline (e.g. they have to construct a Python level int each time they read a chunk and call indirectly since they can't guarantee they're wrapping a BufferedReader), but it's a solid basis for reimplementing your own scheme.

Update: On checking your full code (which is wildly different from what you've posted), you've got other issues. Your tp_iternext just repeatedly yields None, requiring you to call your object to retrieve the string. That's... unfortunate. That's more than doubling the Python interpreter overhead per item (tp_iternext is cheap to call, being quite specialized; tp_call is not nearly so cheap, going through convoluted general purpose code paths, requiring the interpreter to pass an empty tuple of args you never use, etc.; side-note, PyFastFile_tp_call should be accepting a third argument for the kwds, which you ignore, but must still be accepted; casting to ternaryfunc is silencing the error, but this will break on some platforms).

Final note (not really relevant to performance for all but the smallest files): The contract for tp_iternext does not require you to set an exception when the iterator is exhausted, just that you return NULL;. You can remove your call to PyErr_SetNone( PyExc_StopIteration );; as long as no other exception is set, return NULL; alone indicates end of iteration, so you can save some work by not setting it at all.