Search code examples
phparrayssearchsimilaritylongest-substring

How to find the longest substring that occurs in every element of an array?


I have a collection of texts from some authors. Each author has a unique signature or link that occurs in all of their texts.

Example for Author1:

$texts=['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd

@jhsad.sadas.com sdsdADSA sada',
'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
@jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];

Expected output for Author1 is: @jhsad.sadas.com


Example for Author2:

$texts=['This is some random string representative of non-signature text.

This is the
*author\'s* signature.',
'Different message body text.      This is the
*author\'s* signature.

This is an afterthought that expresses that a signature is not always at the end.',
'Finally, this is unwanted stuff. This is the
*author\'s* signature.'];

Expected output for Author2 is:

This is the
 *author's* signature.

Pay particular notice to the fact there there are no reliable identifying characters (or positions) that signify the start or end of the signature. It could be a url, a Twitter mention, any kind of plain text, etc. of any length containing any sequence of characters that occurs at the start, end, or middle of the string.

I am seeking a method that will extract the longest substring that exists in all $text elements for a single author.

It is expected, for the sake of this task, that all authors WILL have a signature substring that exists in every post/text.

IDEA: I'm thinking of converting words to vectors and finding similarity between each texts. We can use cosine similarity to find the signatures. I think the solution must be some thing like this idea.

mickmackusa's commented code captures the essence of what is desired, but I would like to see if there are other ways to achieve the desired result.


Solution

  • Here is my thinking:

    1. Sort an author's collection of posts by string length (ascending) so that you are working from smaller texts to larger texts.
    2. Split each post's text on one or more white-space characters, so that you are only handling wholly non-white-space substrings during processing.
    3. Find matching substrings that occur in each subsequent post versus an ever-narrowing array of substrings (overlaps).
    4. Group the consecutive matching substrings by analyzing their index value.
    5. "Reconstitute" the grouped consecutive substrings into their original string form (trimmed of leading and trailing white-space characters, of course).
    6. Sort the reconstituted strings by string length (descending) so that the longest string is assigned the 0 index.
    7. Print to screen the substring that is assumed to be the author's signature (as a best guess) based on commonality and length.

    Code: (Demo)

    $posts['Author1'] = ['sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd
    
    @jhsad.sadas.com sdsdADSA sada',
    'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
    hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
    'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
    @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl'];
    
    $posts['Author2'] = ['This is some random string representative of non-signature text.
    
    This is the
     *author\'s* signature.',
            'Different message body text.      This is the
     *author\'s* signature.
    
        This is an afterthought that expresses that a signature is not always at the end.',
            'Finally, this is unwanted stuff. This is the
     *author\'s* signature.'];
    
    foreach ($posts as $author => $texts) {
        echo "Author: $author\n";
        
        usort($texts, function($a, $b) {
            return strlen($a) <=> strlen($b);  // sort ASC by strlen; mb_strlen probably isn't advantageous
        });
        var_export($texts);
        echo "\n";
    
        foreach ($texts as $index => $string) {
            if (!$index) {
                $overlaps = preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY);  // declare with all non-white-space substrings from first text
            } else {
                $overlaps = array_intersect($overlaps, preg_split('/\s+/', $string, 0, PREG_SPLIT_NO_EMPTY));  // filter word bank using narrowing number of words
            }
        }
        var_export($overlaps);
        echo "\n";
        
        // batch consecutive substrings
        $group = null;
        $consecutives = [];  // clear previous iteration's data
        foreach ($overlaps as $i => $word) {
            if ($group === null || $i - $last > 1) {
                $group = $i;
            }
            $last = $i;
            $consecutives[$group][] = $word;
        }
        var_export($consecutives);
        echo "\n";
        
        foreach($consecutives as $words){
            // match potential signatures in first text for measurement:
            if (preg_match_all('/\Q' . implode('\E\s+\Q', $words) . '\E/', $texts[0], $out)) {  // make alternatives characters literal using \Q & \E
                $potential_signatures = $out[0];
            }
        }
        usort($potential_signatures, function($a,$b){
            return strlen($b) <=> strlen($a); // sort DESC by strlen; mb_strlen probably isn't advantageous
        });
        
        echo "Assumed Signature: {$potential_signatures[0]}\n\n";
    }
    

    Output:

    Author: Author1
    array (
      0 => 'sdsadsad daSDA DDASd asd aSD Sd dA  SD ASD sadasdasds sadasd
    
    @jhsad.sadas.com sdsdADSA sada',
      1 => 'jhjkfsdg fdgdf sfds hgfj j kkjjfghgkjf hdkjtkj lfdjfg hkgfl  
    @jhsad.sadas.com dsfjdshflkds kg lsfdkg;fdgl',
      2 => 'KDJKLFFD GFDGFDHGF GFHGFDHGFH GFHFGH Lklfgfd gdfsgfdsg  df gfdhgf g  
    hfghghjh jhg @jhsad.sadas.com sfgff fsdfdsf',
    )
    array (
      11 => '@jhsad.sadas.com',
    )
    array (
      11 => 
      array (
        0 => '@jhsad.sadas.com',
      ),
    )
    Assumed Signature: @jhsad.sadas.com
    
    Author: Author2
    array (
      0 => 'Finally, this is unwanted stuff. This is the
     *author\'s* signature.',
      1 => 'This is some random string representative of non-signature text.
    
    This is the
     *author\'s* signature.',
      2 => 'Different message body text.      This is the
     *author\'s* signature.
    
        This is an afterthought that expresses that a signature is not always at the end.',
    )
    array (
      2 => 'is',
      5 => 'This',
      6 => 'is',
      7 => 'the',
      8 => '*author\'s*',
      9 => 'signature.',
    )
    array (
      2 => 
      array (
        0 => 'is',
      ),
      5 => 
      array (
        0 => 'This',
        1 => 'is',
        2 => 'the',
        3 => '*author\'s*',
        4 => 'signature.',
      ),
    )
    Assumed Signature: This is the
     *author's* signature.