Search code examples
pythonpython-3.xparsingtext-filesdelimiter

How to parse text file using unique delimiters?


Python 3.5.2 on Spyder 2.x

I have thousands of text files which are in the following format kind of semi-structured.

The below is one file one.txt:

Goodsign:       Klisti upto 15:57         Bad Omen:     Gated zone      
 
 
Dusk Attack:        Uptime      Dusk Rest:      Winters

The below is second file second.txt

Goodsign:       Kukul upto 12:60          Bad Omen:     Open zone       
 
 
Dusk Attack:        Downtime        Dusk Rest:      Summers Daring Tribe: Mojars of Moana

Now I want to parse both these files and get the values for the labels Goodsign: which is "Klisti upto 15:57" in one.txt and "Kukul upto 12:60" in second case.

For the next set of variables again the same Bad Omen: get value "Gated zone" and second case Bad Omen: "Open zone".

for the next set of variables again ignore &nbsp and get value for label "Dusk Attack:" repeat the same for label "Dusk Rest:"

the problem apart from the : delimiter there seems to be a tab delimiter between the values such as between Downtime Dusk Rest: there is a gap " " is this tab or how to parse this kind of text?

I tried implementing below code but how to use for only delimiter "Dusk Rest:" for example but it gives all values after that. I need only value "Downtime" whereas it gives me "Downtime Dusk Rest: Summers Daring Tribe: Mojars of Moana" :

f = open('one.txt', 'r')
lines = f.readlines()
f.close()
searchtxt="Dusk Rest:"
for i, line in enumerate(lines):    
    if searchtxt in line and i+1 < len(lines):
    #print(lines[i+1])
    print(line)
    break

Many thanks in advance for your valuable answers!


Solution

  • Another way to work with these files is to split them on a regex, perhaps like this.

    The useful bits of information seem to be separated by at least two consecutive items of whitespace. We can split on those. At the same time we can arrange to eliminate the leading no-backspace HTML elements, if we can assume that they are always of the form &nbsp;\s. Otherwise they would have to be treated separately. Having split the fields we can use the list type's index method to find the field names in the split items to form the values. (This allows for the possibility that we have split the file's contents somewhere inappropriately; we can glue a field back together.

    import re
    
    for file_name in ['one.txt', 'second.txt']:
        print (file_name)
        with open(file_name) as f:
            content = f.read()
            items = re.split(r'\s{2,}(?:&nbsp;\s)*', content)
            print (items)
            results = {}
            results['Goodsign:'] = ' '.join(items[1: items.index('Bad Omen:')])
            results['Bad Omen:'] = ' '.join(items[1+items.index('Bad Omen:'): items.index('Dusk Attack:')])
            results['Dusk Rest:'] = ' '.join(items[1+items.index('Dusk Attack:'):])
            results['Dusk Attack:'] = ' '.join(items[1+items.index('Dusk Attack:'): items.index('Dusk Rest:')])
            results['Dusk Rest:'] = ' '.join(items[1+items.index('Dusk Rest:'):])
            for result in results:
                print (result, results[result])
    

    And here's the output:

    one.txt
    ['Goodsign:', 'Klisti upto 15:57', 'Bad Omen:', 'Gated zone', 'Dusk Attack:', 'Uptime', 'Dusk Rest:', 'Winters']
    Bad Omen: Gated zone
    Goodsign: Klisti upto 15:57
    Dusk Attack: Uptime
    Dusk Rest: Winters
    second.txt
    ['Goodsign:', 'Kukul upto 12:60', 'Bad Omen:', 'Open zone', 'Dusk Attack:', 'Downtime', 'Dusk Rest:', 'Summers']
    Bad Omen: Open zone
    Goodsign: Kukul upto 12:60
    Dusk Attack: Downtime
    Dusk Rest: Summers