Search code examples
regexperlnlptext-segmentation

Split multi-paragraph documents into paragraph-numbered sentences


I have a list of well-parsed, multi-paragraph documents (all paragraphs separated by \n\n and sentences separated by ".") that I'd like to split into sentences, together with a number indicating the paragraph number within the document. For example, the (two paragraph) input is:

First sentence of the 1st paragraph. Second sentence of the 1st paragraph. \n\n 

First sentence of the 2nd paragraph. Second sentence of the 2nd paragraph. \n\n

Ideally the output should be:

1 First sentence of the 1st paragraph. 

1 Second sentence of the 1st paragraph. 

2 First sentence of the 2nd paragraph.

2 Second sentence of the 2nd paragraph.

I'm familiar with the Lingua::Sentences package in Perl that can split documents into sentences. However it is not compatible with paragraph numbering. As such I'm wondering if there's an alternative way to achieve the above (the documents contains no abbreviations). Any help is greatly appreciated. Thanks!


Solution

  • As you mentioned Lingua::Sentences, I think it's an option to manipulate the original output from this module a little bit to get what you need

    use Lingua::Sentence;
    
    my @paragraphs = split /\n{2,}/, $splitter->split($text);
    
    foreach my $index (0..$#paragraphs) {
        my $paragraph = join "\n\n", map { $index+1 . " $_" } 
            split /\n/, $paragraphs[$index];
        print "$paragraph\n\n";
    }