Search code examples
pythonstringparsingline

Parsing input text in a strange format


I have an input document with data in the following format. The three example target words are 'overlook', 'lettered', and 'resignation'. Each is followed by a list of synonyms or, if none were found, just the word None. Because the target word is not included in the list of synonyms, I've prepended "tgws_" to it for identification purposes. Input document:

tgws_overlook
['omit', 'forget', 'ignore', 'discount']

tgws_overlook

['verb']
tgws_lettered

None
tgws_lettered

['adj.']
tgws_resignation

[ 'dejection', 'surrender', 'pessimism', 'defeatism', 'acceptance', 'abdication']
tgws_resignation

['noun']

Note that each target word appears twice; I only want it to appear once in the output. I need to read each line in, and then output a new document with the data looking as follows. Here, though, I'm just printing the output. If the string beginning with tgws_ is a new string, ie if it hasn’t been seen before, then save it in a variable called target_word. If it has been seen before, then ignore it. In the case of None, we just print the target word followed by a hyphen and the part of speech (pos), followed by a dash and the word None, all on one line. Otherwise, we print out the target word on one line, pos on the next line, and synonyms on a third line. Here's what I'm looking for:

tgws_overlook
POS: verb
['omit', 'forget', 'ignore', 'discount']

tgws_lettered – adj. - None

tgws_resignation
POS: noun
[ 'dejection', 'surrender', 'pessimism', 'defeatism', 'acceptance', 'abdication']

Here's the code I wrote, that isn't quite doing it. It repeats targetwords like 6 times... Something is wrong with the loop. And perhaps there is a better way to do this...

def main():
    wordlist = []
    current_word = ""
    target_word = ""

    with open(input_filename, "r") as infile:
        counter = 0
        pos = ""

        for line in infile:
          
          if line.startswith("tgws_"):
            target_word = line
          if line.startswith(("['adv.']", "['pronoun']", "['conjunction']", "['noun']", "['verb']", "['adj.']" )):
              pos = line.strip("['']")
          elif line.startswith("['"):
              wordlist = line
          elif line.startswith("None"):
            wordlist = "[None]"
          print(target_word, pos, wordlist)
          current_word = target_word

if __name__ == "__main__":
    main()

Solution

  • Don't do the printing in the loop that reads the file. Create a dictionary that maps the target word to the wordlist and part of speech. At the end, print the dictionary in the format you want.

    import ast, collections
    
    def main():
        words = collections.defaultdict(dict)
        target_word = None
        parts_of_speech = {"adv.", "pronoun", "conjunction", "noun", "verb", "adj."}
    
        with open(input_filename, "r") as infile:
            for line in infile:
                if line.startswith("tgws_"):
                    target_word = line
                elif line.startswith("["):
                    data = ast.literal_eval(line)
                    if len(data) == 1 and data[0] in parts_of_speech:
                        words[target_word]['pos'] = data[0]
                    else:
                        words[target_word]['wordlist'] = data
                elif line.strip() == 'None':
                    words[target_word]['wordlist'] = None
    
        for word, features in words.items():
            if features.get('wordlist']) is None:
                print(f'{word} - {features["pos"]} - None')
            else:
                print(word)
                print(f'POS: {features['pos']}')
                print(features['wordlist'])
            print()
    

    You can also use ast.literal_eval() to parse the lists, rather than ad hoc string processing.