Search code examples
pythonregexmacostextedit

Add quotation at start and end of every other line ignoring empty line


I need help on organizing texts. I have the list of thousands vocabs in csv. There are term, definition, and sample sentence for each word. Term and definition is separated by the tab and sample sentence is separated by an empty line.

For example:

exacerbate  worsen

This attack will exacerbate the already tense relations between the two communities

exasperate  irritate, vex

he often exasperates his mother with pranks

execrable   very bad, abominable, utterly detestable

an execrable performance

I want to organize this so that the sample sentence is enclosed by double quotation marks, has no empty line before and after itself, and the term in sentence is replaced by the hyphen. All that change while keeping the tab after the term, the new line in the beginning of each term, and the only a space between the definition and the example sentence. I need this format for importing it to flashcards web application.

Desired outcome using above example:

exacerbate  worsen "This attack will – the already tense relations between the two communities"
exasperate  irritate, vex "he often – his mother with pranks"
execrable   very bad, abominable, utterly detestable "an – performance"

I am using Mac. I know basic command lines (including regex) and python, but not enough to figure this out by myself. If you could help me, I am very grateful.


Solution

  • Open the terminal to the directory where you have the input file. Save the following code in a .py file:

    import sys
    import string
    import difflib
    import itertools
    
    
    with open(sys.argv[1]) as fobj:
        lines = fobj.read().split('\n\n')
    
    with open(sys.argv[2], 'w') as out:
        for i in range(0, len(lines), 2):
            line1, example = lines[i:i + 2]
            words = [w.strip(string.punctuation).lower()
                     for w in example.split()]
    
            # if the target word is not in the example sentence,
            # we will find the most similar one
            target = line1.split('\t')[0]
            if target in words:
                most_similar = target
            else:
                most_similar = difflib.get_close_matches(target, words, 1)[0]
            new_example = example.replace(most_similar, '-')
            out.write('{} "{}"\n'.format(line1.strip(), new_example.strip()))
    

    The program needs the input file name and the output file name as command line arguments. That is, execute from the terminal the following command:

    $ python program.py input.txt output.txt
    

    where program.py is the above program, input.txt is your input file, and output.txt is the file that will be created with the format you need.


    I ran the program against the example you provided. I had manually add the tabs because in the question there are only spaces. This is the output produced by the program:

    exacerbate  worsen "This attack will - the already tense relations between the two communities"
    exasperate  irritate, vex "he often - his mother with pranks"
    execrable   very bad, abominable, utterly detestable "an - performance"
    

    The program correctly substitutes exacerbates with a dash in the second example, even though the word is exacerbate. I cannot guarantee that this technique will work for every word in your file without having the file.