Search code examples
javaregexnlptokenize

how to identify a end of a sentence


String x=" i am going to the party at 6.00 in the evening. are you coming with me?";

if i have the above string, i need that to be broken to sentences by using sentence boundry punctuations(like . and ?)

but it should not split the sentence at 6 because of having an pointer there. is there a way to identify what is the correct sentence boundry place in java? i have tried using stringTokenizer in java.util pakage but it always break the sentence whenever it finds a pointer. Can someone suggest me a method to do this correctly?

This is the method which i have tried in tokenizing a text into sentences.

public static ArrayList<String> sentence_segmenter(String text) {
    ArrayList<String> Sentences = new ArrayList<String>();

    StringTokenizer st = new StringTokenizer(text, ".?!");
    while (st.hasMoreTokens()) {

        Sentences.add(st.nextToken());
    }
    return Sentences;
}

also i have a method to segement sentences into phrases, but here also when the program found comma(,) it splits the text. but i dont need to split it when there is a number like 60,000 with a comma in the middle. following is the method i am using to segment the phrases.

   public static ArrayList<String> phrasesSegmenter(String text) {
    ArrayList<String> phrases = new ArrayList<String>();
    StringTokenizer st = new StringTokenizer(text, ",");
    while (st.hasMoreTokens()) {
        phrases.add(st.nextToken());
    }
    return phrases;
}

Solution

  • From the documentation of StringTokenizer:

    StringTokenizer is a legacy class that is retained for compatibility reasons although its use is discouraged in new code. It is recommended that anyone seeking this functionality use the split method of String or the java.util.regex package instead.

    In case you'd use split, you can use any regular expression to split the text into sentences. You probably want something like any of ?!. and either a space or end of text:

    text.split("[?!.]($|\\s)")