Search code examples
javaregexnlppunctuationlanguagetool

Sentence formation: Punctuation checks in java


I want to check the quality of sentence formation. Specifically, I am looking to see if the end-user types a space after a punctuation. I am okay with a NLP library, or a simple java regex solution too.

For example:

  1. "Hi, my name is Tom Cruise. I like movies"
  2. "Hi,my name is Tom Cruise. I like movies"
  3. "Hi,my name is Tom Cruise.I like movies"

Sentence 1 is perfect, sentence 2 is bad since it has 1 punctuation without a space after it, and sentence 3 is the worst since none of the punctuations are succeeded with a space.

Can you please suggest a java approach to this? I tried the languagetool API but didn't work.


Solution

  • Why don't you try Patterns and Unicode categories?

    For instance:

    Pattern pattern = Pattern.compile("\\p{P} ");
            Matcher matcher = pattern.matcher("Hi, my name is Tom Cruise. I like movies");
            while (matcher.find()) {
                System.out.println(matcher.group());
            }
    

    The Pattern here searches for any punctuation followed by a space. The output will be:

    , 
    . 
    

    (notice the space after the comma and the dot)

    You could probably refine your Pattern by specifying which exact punctuation characters are eligible to be followed by a space.

    Finally, in order to check for the opposite (a punctuation character not followed by whitespace):

    Pattern otherPattern = Pattern.compile("\\p{P}\\S");