I am struggling to extract text from a file. The text is in the following format with [] signifying a delimiter.
File Text:
[Dataset 1] "text" [Filename 1] "text" [Filename 2] "text" [Key Data Delimiter] !key data! [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimeter] "text" [Filename 2] [Dataset 3]...
Desired Output:
[Dataset 1], [Filename 2], !key data!.
[Dataset 2], [Filename 1], !key data!.
With the filename being after which filename the key delimiter appears and before another Dataset. There is only one file containing key data per Dataset.
f = open('file.txt', 'r')
TextBetween_KeyDataDelimeter = re.findall('KeyDataDelimeter(.+?)KeyDataDelimiter',f.read(), re.DOTALL)
I'm thinking of nested for loops with if/else statements but that seems quite messy. Can someone please point me to docs I should read to help me out.
Here's an option without regex, just some string and list manipulations. Somewhat convoluted, but it works:
kds = """[Dataset 1] "text1" [Filename 1] "text2" [Filename 2] "text3" [Key Data Delimiter] !key data1![Key Data Delimiter] "text4" [Filename 3] "text5" [Dataset 2] "text6" [Filename 1] [Key Data Delimiter] key data2 [Key Data Delimeter] "text7" [Filename 2]"""
# split the text file into datasets
nkds = kds.replace('[Dataset','xxx[Dataset').split('xxx')
for k in nkds[1:]:
entry = ''
#split each dataset into components
nk = k.replace('[','xxx[').split('xxx')[1:]
#get the name of the dataset
entry+= nk[0].replace(']',']xxx').split('xxx')[0]
for k in nk:
#find the index position of the delimiter in the dataset list
if '[Key Data Delimiter]' in k:
#get the previous index position for the file name
file_ind = nk.index(k)-1
entry+= nk[file_ind].replace(']',']xxx').split('xxx')[0]
entry+= k.split(']')[1].strip()
break
print(entry)
Output:
[Dataset 1][Filename 2]!key data1!
[Dataset 2][Filename 1]key data2