I'm looking to implement a few lines of python, using re, to firstly manipulate a string then use that string as a regex search. I have strings with *
's in the middle of them, i.e. ab***cd
, with the *
's being any length. The aim of this is to do the regex search in a document to extract any lines that match the starting and finishing characters, with any number of characters in between. i.e. ab12345cd, abbbcd, ab_fghfghfghcd, would all be positive matches. Examples of negative matches: 1abcd, agcd, bb111cd.
I have come up with the regex of [\s\S]*?
to input instead of the *
's. So I want to get from an example string of ab***cd
to ^ab[\s\S]*?cd
, I will then use that for a regex search of a document.
I then wanted to open the file in mmap, search through it using the regex, then save the matches to file.
import re
import mmap
def file_len(fname):
with open(fname) as f:
for i, l in enumerate(f):
pass
return i + 1
def searchFile(list_txt, raw_str):
search="^"+raw_str #add regex ^ newline operator
search_rgx=re.sub(r'\*+',r'[\\s\\S]*?',search) #replace * with regex function
#search file
with open(list_txt, 'r+') as f:
data = mmap.mmap(f.fileno(), 0)
results = re.findall(bytes(search_rgx,encoding="utf-8"),data, re.MULTILINE)
#save results
f1 = open('results.txt', 'w+b')
results_bin = b'\n'.join(results)
f1.write(results_bin)
f1.close()
print("Found "+str(file_len("results.txt"))+" results")
searchFile("largelist.txt","ab**cd")
Now this works fine with a small file. However when the file gets larger (1gb of text) I get this error:
Traceback (most recent call last):
File "c:\Programming\test.py", line 27, in <module>
searchFile("largelist.txt","ab**cd")
File "c:\Programming\test.py", line 21, in searchFile
results_bin = b'\n'.join(results)
MemoryError
Firstly - can anyone help optimize the code slightly? Am I doing something seriously wrong? I used mmap because I know I wanted to look at large files and I wanted to read the file line and by line rather than all at once (hence someone suggested mmap).
I've also been told to have a look at the pandas library for more data manipulation. Would panda's replace mmap?
Thanks for any help. I'm pretty new to python as you can tell - so appreciate any help.
How about this? In this situation, what you want is a list of all of your lines represented as strings. The following emulates that, resulting in a list of strings:
import io
longstring = """ab12345cd
abbbcd
ab_fghfghfghcd
1abcd
agcd
bb111cd
"""
list_of_strings = io.StringIO(longstring).read().splitlines()
list_of_strings
Outputs
['ab12345cd', 'abbbcd', 'ab_fghfghfghcd', '1abcd', 'agcd', 'bb111cd']
This is the part that matters
s = pd.Series(list_of_strings)
s[s.str.match('^ab[\s\S]*?cd')]
Outputs
0 ab12345cd
1 abbbcd
2 ab_fghfghfghcd
dtype: object
Edit2: Try this: (I don't see a reason for you to want to it as a function, but I've done it like that since that what you did in the comments.)
def newsearch(filename):
with open(filename, 'r', encoding="utf-8") as f:
list_of_strings = f.read().splitlines()
s = pd.Series(list_of_strings)
s = s[s.str.match('^ab[\s\S]*?cd')]
s.to_csv('output.txt', header=False, index=False)
newsearch('list.txt')
A chunk-based approach
import os
def newsearch(filename):
outpath = 'output.txt'
if os.path.exists(outpath):
os.remove(outpath)
for chunk in pd.read_csv(filename, sep='|', header=None, chunksize=10**6):
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv(outpath, index=False, header=False, mode='a')
newsearch('list.txt')
A dask approach
import dask.dataframe as dd
def newsearch(filename):
chunk = dd.read_csv(filename, header=None, blocksize=25e6)
chunk = chunk[chunk[0].str.match('^ab[\s\S]*?cd')]
chunk[0].to_csv('output.txt', index=False, header=False, single_file = True)
newsearch('list.txt')