Search code examples
phphtmlregexsplitsentence

Spliting sentences into paragraph based on numbers of words count


I want to split a sentence into a paragraph and each paragraph should have less than numbers of words. For example:

Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. 

Paragraph 1: 
Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.

Paragraph 2: 
Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source. 

In the above example, words less than 20 is in a paragraph 1 and rest are on Paragraph 2.

Is there any way to achieve this using php ?

I have tried $abc = explode(' ', $str, 20); which will store 20 words in a array then the rest of them to last array $abc['21']. How could I extract data from first 20 array as the first paragraph then the rest as the second paragraph ?


Solution

  • First split string into sentences. Then loop over sentences array, start by adding the sentence to a paragraphs array, then count the words in that element of the paragraphs array, if greater than 19 increment paragraph counter.

    $string = 'Contrary to popular belief, Lorem Ipsum is not simply random text. It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old. Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.';
    
    $sentences = preg_split('/(?<=[.?!;])\s+(?=\p{Lu})/', $string);
    
    $ii = 0;
    $paragraphs = array();
    foreach ( $sentences as $value ) {
        if ( isset($paragraphs[$ii]) ) { $paragraphs[$ii] .= $value; }
        else { $paragraphs[$ii] = $value; }
        if ( 19 < str_word_count($paragraphs[$ii]) ) {
            $ii++;
        }
    }
    print_r($paragraphs);
    

    Output:

    Array
    (
        [0] => Contrary to popular belief, Lorem Ipsum is not simply random text.It has roots in a piece of classical Latin literature from 45 BC, making it over 2000 years old.
        [1] => Richard McClintock, a Latin professor at Hampden-Sydney College in Virginia, looked up one of the more obscure Latin words, consectetur, from a Lorem Ipsum passage, and going through the cites of the word in classical literature, discovered the undoubtable source.
    )
    

    Sentence splitter found here: Splitting paragraphs into sentences with regexp and PHP