Search code examples
javastanford-nlppos-tagger

Score each sentence in a line based upon a tag and summarize the text. (Java)


I'm trying to create a summarizer in Java. I'm using the Stanford Log-linear Part-Of-Speech Tagger to tag the words, and then, for certain tags, I'm scoring the sentence and finally in the summary, I'm printing sentences with a high score value. Here's the code:

    MaxentTagger tagger = new MaxentTagger("taggers/bidirectional-distsim-wsj-0-18.tagger");

    BufferedReader reader = new BufferedReader( new FileReader ("C:\\Summarizer\\src\\summarizer\\testing\\testingtext.txt"));
    String line  = null;
    int score = 0;
    StringBuilder stringBuilder = new StringBuilder();
    File tempFile = new File("C:\\Summarizer\\src\\summarizer\\testing\\tempFile.txt");
    Writer writerForTempFile = new BufferedWriter(new FileWriter(tempFile));


    String ls = System.getProperty("line.separator");
    while( ( line = reader.readLine() ) != null )
    {
        stringBuilder.append( line );
        stringBuilder.append( ls );
        String tagged = tagger.tagString(line);
        Pattern pattern = Pattern.compile("[.?!]"); //Find new line
        Matcher matcher = pattern.matcher(tagged);
        while(matcher.find())
        {
            Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
            Matcher tagMatcher = tagFinder.matcher(matcher.group());
            while(tagMatcher.find())
            {
                score++; // increase score of sentence for every occurence of adjective tag
            }
            if(score > 1)
                writerForTempFile.write(stringBuilder.toString());
            score = 0;
            stringBuilder.setLength(0);
        }

    }

    reader.close();
    writerForTempFile.close();

The above code isn't working. Although, if I cut my work and generate score for every line(not sentence),it works. But summaries aren't generated that way,are they? Here's the code for that: (all the declarations being the same as above)

while( ( line = reader.readLine() ) != null )
        {
            stringBuilder.append( line );
            stringBuilder.append( ls );
            String tagged = tagger.tagString(line);
            Pattern tagFinder = Pattern.compile("/JJ"); // find adjective tag
            Matcher tagMatcher = tagFinder.matcher(tagged);
            while(tagMatcher.find())
            {
                score++;  //increase score of line for every occurence of adjective tag
            }
            if(score > 1)
                writerForTempFile.write(stringBuilder.toString());
            score = 0;
            stringBuilder.setLength(0);
        }

EDIT 1:

Information regarding what the MaxentTagger does. A sample code to show it's functioning :

import java.io.IOException;

import edu.stanford.nlp.tagger.maxent.MaxentTagger;

public class TagText {
    public static void main(String[] args) throws IOException,
            ClassNotFoundException {

        // Initialize the tagger
        MaxentTagger tagger = new MaxentTagger(
                "taggers/bidirectional-distsim-wsj-0-18.tagger");

        // The sample string
        String sample = "This is a sample text";

        // The tagged string
        String tagged = tagger.tagString(sample);

        // Output the result
        System.out.println(tagged);
    }
}

Output:

This/DT is/VBZ a/DT sample/NN sentence/NN

EDIT 2:

Modified code using BreakIterator to find sentence breaks. Yet the problem is persisting.

while( ( line = reader.readLine() ) != null )
        {
            stringBuilder.append( line );
            stringBuilder.append( ls );
            String tagged = tagger.tagString(line);
            BreakIterator bi = BreakIterator.getSentenceInstance();
            bi.setText(tagged);
            int end, start = bi.first();
            while ((end = bi.next()) != BreakIterator.DONE)
            {
                String sentence = tagged.substring(start, end);
                Pattern tagFinder = Pattern.compile("/JJ");
                Matcher tagMatcher = tagFinder.matcher(sentence);
                while(tagMatcher.find())
                {
                    score++;
                }
                scoreTracker.add(score);
                if(score > 1)
                    writerForTempFile.write(stringBuilder.toString());
                score = 0;
                stringBuilder.setLength(0);
                start = end;
            }

Solution

  • Finding sentence breaks can be a bit more involved than just looking for [.?!], consider using BreakIterator.getSentenceInstance()

    Its performance is actually quite similar to LingPipe's (more complex) implementation, and better than the one in OpenNLP (from my own testing, at least).

    Sample Code

    BreakIterator bi = BreakIterator.getSentenceInstance();
    bi.setText(text);
    int end, start = bi.first();
    while ((end = bi.next()) != BreakIterator.DONE) {
        String sentence = text.substring(start, end);
        start = end;
    }
    

    Edit

    I think this is what you're looking for:

        Pattern tagFinder = Pattern.compile("/JJ");
        BufferedReader reader = getMyReader();
        String line = null;
        while ((line = reader.readLine()) != null) {
            BreakIterator bi = BreakIterator.getSentenceInstance();
            bi.setText(line);
            int end, start = bi.first();
            while ((end = bi.next()) != BreakIterator.DONE) {
                String sentence = line.substring(start, end);
                String tagged = tagger.tagString(sentence);
                int score = 0;
                Matcher tag = tagFinder.matcher(tagged);
                while (tag.find())
                    score++;
                if (score > 1)
                    writerForTempFile.println(sentence);
                start = end;
            }
        }