I am performing the following operations on lists of words. I read lines in from a Project Gutenberg text file, split each line on spaces, perform general punctuation substitution, and then print each word and punctuation tag on its own line for further processing later. I am unsure how to replace every single quote with a tag or excepting all apostrophes. My current method is to use a compiled regex:
apo = re.compile("[A-Za-z]'[A-Za-z]")
and perform the following operation:
if "'" in word and !apo.search(word):
word = word.replace("'","\n<singlequote>")
but this ignores cases where a single quote is used around a word with an apostrophe. It also does not indicate to me whether the single quote is abutting the start of a word of the end of a word.
Example input:
don't
'George
ma'am
end.'
didn't.'
'Won't
Example output (after processing and printing to file):
don't
<opensingle>
George
ma'am
end
<period>
<closesingle>
didn't
<period>
<closesingle>
<opensingle>
Won't
I do have a further question in relation to this task: since the distinguishment of <opensingle>
vs <closesingle>
seems rather difficult, would it be wiser to perform substitutions like
word = word.replace('.','\n<period>')
word = word.replace(',','\n<comma>')
after performing the replacement operation?
What you really need to properly replace starting and ending '
is regex.
To match them you should use:
^'
for starting '
(opensingle),'$
for ending '
(closesingle).Unfortunately, replace
method does not support regexes,
so you should use re.sub
instead.
Below you have an example program, printing your desired output (in Python 3):
import re
str = "don't 'George ma'am end.' didn't.' 'Won't"
words = str.split(" ")
for word in words:
word = re.sub(r"^'", '<opensingle>\n', word)
word = re.sub(r"'$", '\n<closesingle>', word)
word = word.replace('.', '\n<period>')
word = word.replace(',', '\n<comma>')
print(word)