Search code examples
pythonregextext-extraction

Extracting text from text file with recurring nested pattern


I am struggling to extract text from a file. The text is in the following format with [] signifying a delimiter.

File Text:

[Dataset 1] "text" [Filename 1] "text" [Filename 2] "text" [Key Data Delimiter] !key data! [Key Data Delimiter] "text" [Filename 3] "text" [Dataset 2] "text" [Filename 1] [Key Data Delimiter] key data [Key Data Delimeter] "text" [Filename 2] [Dataset 3]...

Desired Output:

[Dataset 1], [Filename 2], !key data!.
[Dataset 2], [Filename 1], !key data!.

With the filename being after which filename the key delimiter appears and before another Dataset. There is only one file containing key data per Dataset.

f = open('file.txt', 'r')
TextBetween_KeyDataDelimeter = re.findall('KeyDataDelimeter(.+?)KeyDataDelimiter',f.read(), re.DOTALL)

I'm thinking of nested for loops with if/else statements but that seems quite messy. Can someone please point me to docs I should read to help me out.


Solution

  • Here's an option without regex, just some string and list manipulations. Somewhat convoluted, but it works:

    kds = """[Dataset 1] "text1" [Filename 1] "text2" [Filename 2] "text3" [Key Data Delimiter] !key data1![Key Data Delimiter] "text4" [Filename 3] "text5" [Dataset 2] "text6" [Filename 1] [Key Data Delimiter] key data2 [Key Data Delimeter] "text7" [Filename 2]"""
    
    # split the text file into datasets
    nkds = kds.replace('[Dataset','xxx[Dataset').split('xxx')
    
    for k in nkds[1:]:
        entry = ''
        #split each dataset into components
        nk = k.replace('[','xxx[').split('xxx')[1:]
        #get the name of the dataset
        entry+= nk[0].replace(']',']xxx').split('xxx')[0]
        for k in nk:
            #find the index position of the delimiter in the dataset list
            if '[Key Data Delimiter]' in k:
                #get the previous index position for the file name
                file_ind = nk.index(k)-1
                entry+= nk[file_ind].replace(']',']xxx').split('xxx')[0]
                entry+= k.split(']')[1].strip()
                break
        print(entry)
    

    Output:

    [Dataset 1][Filename 2]!key data1!
    [Dataset 2][Filename 1]key data2