I have an input document with data in the following format. The three example target words are 'overlook', 'lettered', and 'resignation'. Each is followed by a list of synonyms or, if none were found, just the word None. Because the target word is not included in the list of synonyms, I've prepended "tgws_" to it for identification purposes. Input document:
tgws_overlook
['omit', 'forget', 'ignore', 'discount']
tgws_overlook
['verb']
tgws_lettered
None
tgws_lettered
['adj.']
tgws_resignation
[ 'dejection', 'surrender', 'pessimism', 'defeatism', 'acceptance', 'abdication']
tgws_resignation
['noun']
Note that each target word appears twice; I only want it to appear once in the output. I need to read each line in, and then output a new document with the data looking as follows. Here, though, I'm just printing the output. If the string beginning with tgws_ is a new string, ie if it hasn’t been seen before, then save it in a variable called target_word. If it has been seen before, then ignore it. In the case of None, we just print the target word followed by a hyphen and the part of speech (pos), followed by a dash and the word None, all on one line. Otherwise, we print out the target word on one line, pos on the next line, and synonyms on a third line. Here's what I'm looking for:
tgws_overlook
POS: verb
['omit', 'forget', 'ignore', 'discount']
tgws_lettered – adj. - None
tgws_resignation
POS: noun
[ 'dejection', 'surrender', 'pessimism', 'defeatism', 'acceptance', 'abdication']
Here's the code I wrote, that isn't quite doing it. It repeats targetwords like 6 times... Something is wrong with the loop. And perhaps there is a better way to do this...
def main():
wordlist = []
current_word = ""
target_word = ""
with open(input_filename, "r") as infile:
counter = 0
pos = ""
for line in infile:
if line.startswith("tgws_"):
target_word = line
if line.startswith(("['adv.']", "['pronoun']", "['conjunction']", "['noun']", "['verb']", "['adj.']" )):
pos = line.strip("['']")
elif line.startswith("['"):
wordlist = line
elif line.startswith("None"):
wordlist = "[None]"
print(target_word, pos, wordlist)
current_word = target_word
if __name__ == "__main__":
main()
Don't do the printing in the loop that reads the file. Create a dictionary that maps the target word to the wordlist and part of speech. At the end, print the dictionary in the format you want.
import ast, collections
def main():
words = collections.defaultdict(dict)
target_word = None
parts_of_speech = {"adv.", "pronoun", "conjunction", "noun", "verb", "adj."}
with open(input_filename, "r") as infile:
for line in infile:
if line.startswith("tgws_"):
target_word = line
elif line.startswith("["):
data = ast.literal_eval(line)
if len(data) == 1 and data[0] in parts_of_speech:
words[target_word]['pos'] = data[0]
else:
words[target_word]['wordlist'] = data
elif line.strip() == 'None':
words[target_word]['wordlist'] = None
for word, features in words.items():
if features.get('wordlist']) is None:
print(f'{word} - {features["pos"]} - None')
else:
print(word)
print(f'POS: {features['pos']}')
print(features['wordlist'])
print()
You can also use ast.literal_eval()
to parse the lists, rather than ad hoc string processing.