Search code examples
phpregexrepeatwords

Finding repeated words in PHP without specifying the word itself


I've been thinking about something for a project I want to do, I'm not an advance user and I'm just learning. Do not know if this is possible:

Suppose we have 100 html documents containing many tables and text inside them.

Question one is: is it possible to analyze all this text and find words repeated and count it?.

Yes, It's possible to do with some functions but here's the problem: what if we did not know the words that will gonna find? That is, we would have to tell the code what a word means.

Suppose, for example, that one word would be a union of seven characters, the idea would be to find other similar patterns and mention it. What would be the best way to do this?

Thank you very much in advance.

Example:

Search: Five characters patterns on the next phrases:

Text one:

"It takes an ocean not to break"

Text two:

"An ocean is a body of saline water"

Result

Takes 1 
Break 1
water 1
Ocean 2

Thanks in advance for your help.


Solution

  • function get_word_counts($phrases) {
       $counts = array();
        foreach ($phrases as $phrase) {
            $words = explode(' ', $phrase);
            foreach ($words as $word) {
              $word = preg_replace("#[^a-zA-Z\-]#", "", $word);
                $counts[$word] += 1;
            }
        }
        return $counts;
    }
    
    $phrases = array("It takes an ocean of water not to break!", "An ocean is a body of saline water, or so I am told.");
    
    $counts = get_word_counts($phrases);
    arsort($counts);
    print_r($counts);
    

    OUTPUT

    Array
    (
        [of] => 2
        [ocean] => 2
        [water] => 2
        [or] => 1
        [saline] => 1
        [body] => 1
        [so] => 1
        [I] => 1
        [told] => 1
        [a] => 1
        [am] => 1
        [An] => 1
        [an] => 1
        [takes] => 1
        [not] => 1
        [to] => 1
        [It] => 1
        [break] => 1
        [is] => 1
    )
    

    EDIT
    Updated to deal with basic punctuation, based on @Jack's comment.