python pandas mmap large-files python-re

MemoryError in Python by searching a large file using mmap and re.findall

I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *'s in the middle of them, i.e. ab***cd, with the *'s being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.

I have come up with the regex of [\s\S]*? to input instead of the *'s. So I want to get from an example string of ab***cd to ^ab[\s\S]*?cd, I will then use that for a regex search of a document.

I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.

import re
import mmap 

def file_len(fname):
    with open(fname) as f:
        for i, l in enumerate(f):
            pass
    return i + 1

def searchFile(list_txt, raw_str):
    search="^"+raw_str #add regex ^ newline operator
    search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function

    #search file
    with open(list_txt, 'r+') as f: 
        data = mmap.mmap(f.fileno(), 0)
        results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)

    #save results
    f1 = open('results.txt', 'w+b')
    results_bin = b'\n'.join(results)
    f1.write(results_bin)
    f1.close()

    print("Found "+str(file_len("results.txt"))+" results")

searchFile("largelist.txt","ab**cd")

Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:

Traceback (most recent call last):
  File "c:\Programming\test.py", line 27, in <module>
    searchFile("largelist.txt","ab**cd")
  File "c:\Programming\test.py", line 21, in searchFile
    results_bin = b'\n'.join(results)
MemoryError

Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).

I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?

Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.

Solution

How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:

import io

longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""

list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings

Outputs

['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']

This is the part that matters

s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]

Outputs

0         ab12345cd
1            abbbcd
2    ab_fghfghfghcd
dtype: object

Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)

def newsearch(filename):
    with open(filename, 'r', encoding="utf-8") as f:
        list_of_strings = f.read().splitlines()
    s = pd.Series(list_of_strings)
    s = s[s.str.match('^ab[\s\S]*?cd')]
    s.to_csv('output.txt', header=False, index=False)

newsearch('list.txt')

A chunk-based approach

import os

def newsearch(filename):
    outpath = 'output.txt'
    if os.path.exists(outpath):
        os.remove(outpath)
    for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
        chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
        chunk[0].to_csv(outpath, index=False, header=False, mode='a')

newsearch('list.txt')

A dask approach

import dask.dataframe as dd

def newsearch(filename):
    chunk = dd.read_csv(filename, header=None, blocksize=25e6)
    chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
    chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)

newsearch('list.txt')