I want to parse words from a text file. Apostrophes should be preserved, but single quotes should be removed. Here is some test data:
john's apostrophe is a 'challenge'
I am experimenting with grep as follows:
grep -o "[a-z'A-Z]*" file.txt
and it produces:
john's
apostrophe
is
a
'challenge'
Need to get rid of those quotes around the word challenge
.
The correct/desired output should be:
john's
apostrophe
is
a
challenge
EDIT: As the consensus seems to be that apostrophes are problematic to recognize, I am now seeking a way to strip any kind of apostrophe (leading, trailing, embedded) out of all words. The words are to be added to a vocabulary index. The phrase searching should also strip out apostrophes. This may need another question.
Here's a simpler grep
-only approach:
grep -E -o "[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?" file.txt
which in Java is:
Pattern.compile("[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?")
(Both of those mean "an ASCII letter, optionally followed by a mixture of ASCII letters and/or apostrophes and an ASCII letter". The idea being that the matched substring has to start with a letter and end with a letter, but if it's more than two characters long, then it can contain apostrophes.)
To accept non-ASCII letters, the Java could be written as:
Pattern.compile("\\p{L}([\\p{L}']*\\p{L})?")
Edit for updated question (stripping out apostrophes): I don't think you can do that with just grep
; but expanding our repertoire a bit, you can write:
tr -d "'" file.txt | grep -E -o "[a-zA-Z]+"
or in Java:
String apostrippedStr = str.replace("'", "");
Pattern.compile("[a-zA-Z]+") // or "\\p{L}+" for non-ASCII support
// ... apply pattern to apostrippedStr