Search code examples
pythonpython-2.7textdata-structuresfile-format

Parsing semi-unstructured text file with Python. Keeping some rows and pivot others


I have a file with decent structure, but the dates of subsequent events (one or more) are only printed once. I can't figure out how to read the file, recognize dates, and the map them to each game result that follows, until the next date appears.

The data looks like this:

Sa 19.11.2016 
FC Tuggen
FC Basel 1893 II 
1
3

SC Cham 
FC Zürich II 
0
1

SC Kriens
FC Köniz  
3
1

Sa 26.11.2016 
FC Bavois
SC Brühl  
1
4

Mi 30.11.2016 
FC Zürich II
FC Basel 1893 II 
2
2

Each date can apply to one or more game results. I've tried reading through the file and grepping dates

keys = []
for line in d:
    if line[0:2] in ('Sa','So','Mo','Di','Mi','Do','Fr'):
        keys.append(line[2:-1].strip())

But then I don't know how to assign the same date to the games the follow, until the next date arrives. For this I've tried various combinations of enumerate(), xrange(), etc. enumerate() didn't work how I tried because I could only add the first game after each date.

My desired output looks as follows, or a defaultdict(list) with keys as the date and array elements as small dictionaries:

Sa 19.11.2016,FC Tuggen,FC Basel 1893 II,1,3
Sa 19.11.2016,SC Cham,FC Zürich II,0,1
Sa 19.11.2016,SC Kriens,FC Köniz,3,1
Sa 26.11.2016,FC Bavois,SC Brühl,1,4
Mi 30.11.2016,FC Zürich II,FC Basel 1893 II,2,2

Solution

  • Something as simply as the following might work, assuming that the input file has a format similar to what you have shown. Keep track of the last seen date using a variable.

    lastseendate = None
    gameinfo = []
    
    for line in f:
        if line[0:2] in ('Sa','So','Mo','Di','Mi','Do','Fr'):  # date row
            lastseendate = line.strip()
        elif len(line.strip()) == 0:  # empty line
            print(lastseendate + ',' + ','.join(gameinfo))  # print out the row for game just read before
            gameinfo = []  # ready to read the next game info
        else:
            gameinfo.append(line.strip())
    

    If the leading two characters before the date are too many to hardcode, then you could use a regular expression like below.

    import re
    pat = re.compile("[A-Za-z] \d{2}\.\d{2}\.\d{4}")
    

    Then replace the # date row line with

    if pat.match(line):
    

    EDIT

    1. This piece of code does not print the info of the last game in the file unless there is an empty line at the end of the file. To fix this, either add an empty line at the end of the file or repeat the print statement after the loop ends.
    2. Removed \n in the print statement (unnecessary as print already prints new line).