I'm a new Python programmer (more experience in R) using Pycharm community edition v2019 2.4, using a laptop running Windows 10. I'm attempting to extract a block of text between two delimiters which is usually in the following format. (text is between the delimiters but on separate lines)
Item 7.
text, text, text, text
text, text, text, text
Item 7A.
The problem I'm experiencing is that Item 7
and Item 7A
can come in many different formats due to the initial pre-processing of the text files, for example.
Item 7.
text
Item 7A.
or
ITEM 7
text
ITEM 7A.
or
ITEM 7
text
ITEM 7A:
or
Item
7
text
Item
7A.
Item 7
and Item 7A
can, also appear in larger blocks of text. This is an issue beyond my control.
I've examined 100 text files so far and have written the following code.
import glob
import os
from os.path import isfile
path = filepath`
for filename in glob.glob(os.path.join(path, '*.txt')):
with open(filename) as f:
data = f.read()
x = re.findall(r'Item 7(.*?)Item 7A',data, re.DOTALL)
"".join(x).replace('\n',' ')
print(x)
file = open('C:/R_Practice/dale1.txt', 'w')
file.write(str(x))
file.close()
This deals with some, but not all of the cases, and even then it's not detecting everything. It won't be possible to analyse the full set of text files as there will be close to 250,000 for the full study. My questions are as follows.
Any help would be appreciated.
Instead of static space, use \s
(that means any kind of spaces, including linebreak) between item
& 7
import glob
import os
from os.path import isfile
path = filepath
for filename in glob.glob(os.path.join(path, '*.txt')):
with open(filename) as f:
data = f.read()
x = re.findall(r'Item\s+7(.*?)Item\s+7A',data, re.DOTALL | re.IGNORECASE)
# here ___^^^ and ___^^^
"".join(x).replace('\n',' ')
print(x)
file = open('C:/R_Practice/dale1.txt', 'w')
file.write(str(x))
file.close()