Search code examples
pythonpandasmmaplarge-filespython-re

MemoryError in Python by searching a large file using mmap and re.findall


I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *'s in the middle of them, i.e. ab***cd, with the *'s being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.

I have come up with the regex of [\s\S]*? to input instead of the *'s. So I want to get from an example string of ab***cd to ^ab[\s\S]*?cd, I will then use that for a regex search of a document.

I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.

import re
import mmap 

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def searchFile(list_txt, raw_str):
    search="^"+raw_str #add regex ^ newline operator
    search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function

    #search file
    with open(list_txt, 'r+') as f: 
        data = mmap.mmap(f.fileno(), 0)
        results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)

    #save results
    f1 = open('results.txt', 'w+b')
    results_bin = b'\n'.join(results)
    f1.write(results_bin)
    f1.close()

    print("Found "+str(file_len("results.txt"))+" results")

searchFile("largelist.txt","ab**cd")

Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:

Traceback (most recent call last):
  File "c:\Programming\test.py", line 27, in <module>
    searchFile("largelist.txt","ab**cd")
  File "c:\Programming\test.py", line 21, in searchFile
    results_bin = b'\n'.join(results)
MemoryError

Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).

I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?

Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.


Solution

  • How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:

    import io
    
    longstring = """ab12345cd
    abbbcd
    ab_fghfghfghcd
    1abcd
    agcd
    bb111cd
    """
    
    list_of_strings = io.StringIO(longstring).read().splitlines()
    list_of_strings
    

    Outputs

    ['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']
    

    This is the part that matters

    s = pd.Series(list_of_strings)
    s[s.str.match('^ab[\s\S]*?cd')]
    

    Outputs

    0         ab12345cd
    1            abbbcd
    2    ab_fghfghfghcd
    dtype: object
    

    Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)

    def newsearch(filename):
        with open(filename, 'r', encoding="utf-8") as f:
            list_of_strings = f.read().splitlines()
        s = pd.Series(list_of_strings)
        s = s[s.str.match('^ab[\s\S]*?cd')]
        s.to_csv('output.txt', header=False, index=False)
    
    newsearch('list.txt')
    

    A chunk-based approach

    import os
    
    def newsearch(filename):
        outpath = 'output.txt'
        if os.path.exists(outpath):
            os.remove(outpath)
        for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
            chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
            chunk[0].to_csv(outpath, index=False, header=False, mode='a')
    
    newsearch('list.txt')
    

    A dask approach

    import dask.dataframe as dd
    
    def newsearch(filename):
        chunk = dd.read_csv(filename, header=None, blocksize=25e6)
        chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
        chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)
    
    newsearch('list.txt')