I need help on organizing texts. I have the list of thousands vocabs in csv. There are term, definition, and sample sentence for each word. Term and definition is separated by the tab and sample sentence is separated by an empty line.
For example:
exacerbate worsen
This attack will exacerbate the already tense relations between the two communities
exasperate irritate, vex
he often exasperates his mother with pranks
execrable very bad, abominable, utterly detestable
an execrable performance
I want to organize this so that the sample sentence is enclosed by double quotation marks, has no empty line before and after itself, and the term in sentence is replaced by the hyphen. All that change while keeping the tab after the term, the new line in the beginning of each term, and the only a space between the definition and the example sentence. I need this format for importing it to flashcards web application.
Desired outcome using above example:
exacerbate worsen "This attack will – the already tense relations between the two communities"
exasperate irritate, vex "he often – his mother with pranks"
execrable very bad, abominable, utterly detestable "an – performance"
I am using Mac. I know basic command lines (including regex) and python, but not enough to figure this out by myself. If you could help me, I am very grateful.
Open the terminal to the directory where you have the input file.
Save the following code in a .py
file:
import sys
import string
import difflib
import itertools
with open(sys.argv[1]) as fobj:
lines = fobj.read().split('\n\n')
with open(sys.argv[2], 'w') as out:
for i in range(0, len(lines), 2):
line1, example = lines[i:i + 2]
words = [w.strip(string.punctuation).lower()
for w in example.split()]
# if the target word is not in the example sentence,
# we will find the most similar one
target = line1.split('\t')[0]
if target in words:
most_similar = target
else:
most_similar = difflib.get_close_matches(target, words, 1)[0]
new_example = example.replace(most_similar, '-')
out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))
The program needs the input file name and the output file name as command line arguments. That is, execute from the terminal the following command:
$ python program.py input.txt output.txt
where program.py
is the above program, input.txt
is your input file, and output.txt
is the file that will be created with the format you need.
I ran the program against the example you provided. I had manually add the tabs because in the question there are only spaces. This is the output produced by the program:
exacerbate worsen "This attack will - the already tense relations between the two communities"
exasperate irritate, vex "he often - his mother with pranks"
execrable very bad, abominable, utterly detestable "an - performance"
The program correctly substitutes exacerbates
with a dash in the second example, even though the word is exacerbate
. I cannot guarantee that this technique will work for every word in your file without having the file.