I have made my own corpus of misspelled words.
misspellings_corpus.txt
:
English, enlist->Enlish
Hallowe'en, Halloween->Hallowean
I'm having an issue with my format. Thankfully, it is at least consistent.
Current format:
correct, wrong1, wrong2->wrong3
Desired format:
wrong1,wrong2,wrong3->correct
wrong<N>
isn't of concern,wrong<N>
words per line (separated by a comma: ,
),correct
word per line (which should be to the right of ->
).Failed Attempt:
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
correct = line.split(', ')[0].strip()
print(correct)
W = line.split(', ')[1].strip()
print(W)
wrong_1 = W.split('->')[0] # however, there might be loads of wrong words
wrong_2 = W.split('->')[1]
newfile.write(wrong_1 + ', ' + wrong_2 + '->' + correct)
Output new.txt
(isn't working):
enlist, Enlish->EnglishHalloween, Hallowean->Hallowe'en
Solution: (Inspired by @alexis)
with open('misspellings_corpus.txt') as oldfile, open('new.txt', 'w') as newfile:
for line in oldfile:
#line = 'correct, wrong1, wrong2->wrong3'
line = line.strip()
terms = re.split(r", *|->", line)
newfile.write(",".join(terms[1:]) + "->" + terms[0] + '\n')
Output new.txt
:
enlist,Enlish->English
Halloween,Hallowean->Hallowe'en
Let's assume all the commas are word separators. I'll break each line on commas and arrows, for convenience:
import re
line = 'correct, wrong1, wrong2->wrong3'
terms = re.split(r", *|->", line)
new_line = ", ".join(terms[1:]) + "->" + terms[0]
print(new_line)
You can put that back in a file-reading loop, right?