Search code examples
javaregexgrepapostrophe

Regular Expression to strip away single quotes and preserve apostrophes


I want to parse words from a text file. Apostrophes should be preserved, but single quotes should be removed. Here is some test data:

john's apostrophe is a 'challenge'

I am experimenting with grep as follows:

grep -o "[a-z'A-Z]*" file.txt

and it produces:

john's
apostrophe
is
a
'challenge'

Need to get rid of those quotes around the word challenge.

The correct/desired output should be:

john's
apostrophe
is
a
challenge

EDIT: As the consensus seems to be that apostrophes are problematic to recognize, I am now seeking a way to strip any kind of apostrophe (leading, trailing, embedded) out of all words. The words are to be added to a vocabulary index. The phrase searching should also strip out apostrophes. This may need another question.


Solution

  • Here's a simpler grep-only approach:

    grep -E -o "[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?" file.txt
    

    which in Java is:

    Pattern.compile("[a-zA-Z]([a-z'A-Z]*[a-zA-Z])?")
    

    (Both of those mean "an ASCII letter, optionally followed by a mixture of ASCII letters and/or apostrophes and an ASCII letter". The idea being that the matched substring has to start with a letter and end with a letter, but if it's more than two characters long, then it can contain apostrophes.)

    To accept non-ASCII letters, the Java could be written as:

    Pattern.compile("\\p{L}([\\p{L}']*\\p{L})?")
    

    Edit for updated question (stripping out apostrophes): I don't think you can do that with just grep; but expanding our repertoire a bit, you can write:

    tr -d "'" file.txt | grep -E -o "[a-zA-Z]+"
    

    or in Java:

    String apostrippedStr = str.replace("'", "");
    
    Pattern.compile("[a-zA-Z]+") // or "\\p{L}+" for non-ASCII support
    // ... apply pattern to apostrippedStr