Search code examples
pythonstringtext

Re-formatting a text file


I am fairly new to Python. I have a text file, full of common misspellings. The correct spelling of the word is prefixed with a $ character, and all misspelled versions of the word preceding it; one on each line.

mispelling.txt:

$year
eyar
yera
$years
eyars
eyasr
yeasr
yeras
yersa

I want to create a new text file, based on mispelling.txt, where the format appears as this: new_mispelling.txt:

eyar->year
yera->year
eyars->years
eyasr->years
yeasr->years
yeras->years
yersa->years

The correct spelling of the word is on the right-hand side of its misspelling, separated by ->; on the same line.


Question:

How do I read in the file, read $ as a new word and thus a new line in my output file, propagate an output file and save to disk?

The purpose of this is to have my collected data be of the same format as this open-source Wikipedia entry dataset of "all" commonly misspelled words, that doesn't contain my own entries of words and misspellings.


Solution

  • As you process the file line-by-line, if you find a word that starts with $, set that as the "currently active correct spelling". Then each subsequent line is a misspelling for that word, so format that into a string and write it to the output file.

    current_word = ""
    with open("mispelling.txt") as f_in, open("new_mispelling.txt", "w") as f_out:
        for line in f_in:
            line = line.strip() # Remove whitespace at start and end 
            if line.startswith("$"):
                # If the line starts with $
                # Slice the current line from the second character to the end
                # And save it as current_word
                current_word = line[1:] 
            else:
                # If it doesn't start with $, create the string we want
                # And write it. 
                f_out.write(f"{line}->{current_word}\n")
    

    With your input file, this gives:

    eyar->year
    yera->year
    eyars->years
    eyasr->years
    yeasr->years
    yeras->years
    yersa->years
    

    The f"{line}->{current_word}\n" construct is called an f-string and is used for string interpolation in python 3.6+.