I am looking for a way to go through a sentence to see if an apostrophe is a quote or a contraction so I can remove punctuation from the string, and then normalize all words.
My test sentence is: don't frazzel the horses. 'she said wow'.
In my attempts I have split the sentence into words parts tokonizing on words and non words like so:
contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
sentence = "don't frazzel the horses. 'she said wow'.".split(/(\w+)|(\W+)/i).reject! { |word| word.empty? }
This returns ["don", "'", "t", " ", "frazzel", " ", "the", " ", "horses", ". '", "she", " ", "said", " ", "wow", "'."]
Next I want to be able to iterate sentence looking for apostrophes '
and when one is found, compare the next element to see if it is included in the contractionEndings
array. If it is included I want to join the prefix, the apostrophe '
, and the suffix into one index, else remove the apostrophes.
In this example, don
, '
, and t
would be joined into don't
as a single index, but . '
and '.
would be removed.
Afterwards I can run a regex to remove other punctuation from the sentence so that I can pass it into my stemmer to normalize the input.
The final output I am after is don't frazzel the horses she said wow
in which all punctuation will be removed besides apostrophes for contractions.
If anyone has any suggestions to make this work or have a better idea on how to solve this problem I would like to know.
Overall I want to remove all punctuation from the sentence except for contractions.
Thanks
How about this?
irb:0> s = "don't frazzel the horses. 'she said wow'."
irb:0> contractionEndings = ["d", "l", "ll", "m", "re", "s", "t", "ve"]
irb:0> s.scan(/\w+(?:'(?:#{contractionEndings.join('|')}))?/)
=> ["don't", "frazzel", "the", "horses", "she", "said", "wow"]
The regex scans for some "word" characters, and then optionally (with the ?
) an apostrophe-plus-contraction ending. You can subsitute in Ruby expressions just like double-quote strings do, so we can get our contractions in, joining them with the regex alternation operator |
. The last thing is to mark the groups (sections in parentheses) as non-capturing with ?:
so that scan doesn't return a bunch of nil
s, just the whole match per-iteration.
Or maybe you don't need the list of explicit abbreviation endings with this method. I also fixed other problematic constructions, thanks to Cary.
irb:0> "don't -frazzel's the jack-o'-lantern's handle, ma'am- 'she said hey-ho'.".scan(/\w[-'\w]*\w(?:'\w+)?/)
=> ["don't", "frazzel's", "the", "jack-o'-lantern's", "handle", "ma'am", "she", "said", "hey-ho"]