Search code examples
pythonregexpython-3.xutf-8large-files

Multiline UTF8 regex replace on a large file


I have large UTF8 text files which have unwanted linebreaks:not all linebreaks are bad, only some of them, namely those that break a sentence. I want to go through the files one by one, joining lines and writing to new output files. I want to use Python 3 (educational reasons!).

I tested the code below on small files. It almost works. But I have three problems.

  1. I get rubbish characters in the outfile near the substitutions.

  2. Is there better practice for larger files (approx 100Mb)? I can't use the line-by-line solutions such as this, presumably. Using the MMAP I think helps with the reading part, but what about the writing part?

  3. Is there a more convenient way of dealing with UTF8 than I have used. Appending .encode() and .decode() is tiresome and I am adding them at random, it seems. Can't I somehow tell the code: "everything is in UTF8"

(I am locating the unwanted line-breaks by a naive regex. I know it could be better, but that's not my concern at the moment.)

#!/opt/local/bin/python3.6
from mmap import mmap, ACCESS_READ
from re import compile,MULTILINE

q=compile('([a-z])\n+([a-z])'.encode(),MULTILINE)
with open("infile", 'rb', 0) as file, open("outfile", "wb") as outfile, \
mmap(file.fileno(), 0, access=ACCESS_READ) as s:
    outfile.write(q.sub("\1 \2".encode(),s))

Solution

  • The following works for me (on Windows), although the read() call may negate whatever benefit you thought you were getting by using mmap. The "rubbish characters" were probably because you forgot the r string prefix (usually needed) on most regex pattern strings. Your regex pattern seems to work, btw.

    from mmap import mmap, ACCESS_READ
    from re import compile, MULTILINE
    
    q = compile(r'([a-z])\r\n+([a-z])', MULTILINE)
    
    with open("regex_subst.txt", 'r', encoding='utf-8') as file, \
         open("outfile.txt", "w", encoding='utf-8') as outfile, \
         mmap(file.fileno(), 0, access=ACCESS_READ) as s:
    
        outfile.write( q.sub(r"\1 \2", s.read().decode()) )
    

    Here's a slightly different method that doesn't call read() but still works. The re module can be used on strings or bytes, as long as the value you pass it are consistently one type or the other.

    In the code below the regex pattern strings have both been prefixed with both the letters r and b thus making them byte patterns instead of str patterns. This makes the q.sub(br"\1 \2", s) not generate a TypeError: cannot use a string pattern on a bytes-like object error. However before writing the results to the UTF8 encoded output file, the byte string result of the substitution must first be explicitly decoded as shown.

    from mmap import mmap, ACCESS_READ
    from re import compile, MULTILINE
    
    q = compile(br'([a-z])\r\n+([a-z])', MULTILINE)
    
    with open("regex_subst.txt", 'r', encoding='utf-8') as file, \
         open("outfile.txt", "w", encoding='utf-8') as outfile, \
         mmap(file.fileno(), 0, access=ACCESS_READ) as s:
    
        outfile.write( q.sub(br"\1 \2", s).decode() )