Search code examples
twittertagsspelling

Twitter Trending Topics: Combine different spellings


Twitter's Trending Topics often consist of more than just one word. But for composed terms there are often different ways of spelling, e.g.:

"Half Blood Prince" / "Half-Blood Prince"

To find all updates mentioning a Trending Topic, you need all the ways of spelling. Twitter does this:

Twitter's Trending Topics Admin

You have the topic name on the left and the different ways of spellings on the right. Do you think this is done manually or automatically? Is it possible to do this automatically? If yes: How?

I hope you can help me. Thanks in advance!


Solution

  • I'll try to answer my own question based on Broken Link's comment (thank you for this):


    You've extracted phrases consisting of 1 to 3 words from your database of documents. Among these extraced phrases there are the following phrases:

    • Half Blood Prince
    • Half-Blood Prince
    • Halfblood Prince

    For each phrase, you strip all special characters and blank spaces and make the string lowercase:

    $phrase = 'Half Blood Prince'; $phrase = preg_replace('/[^a-z]/i', '', $phrase); $phrase = strtolower($phrase); // result is "halfbloodprince"

    When you've done this, all 3 phrases (see above) have one spelling in common:

    • Half Blood Prince => halfbloodprince
    • Half-Blood Prince => halfbloodprince
    • Halfblood Prince => halfbloodprince

    So "halfbloodprince" is the parent phrase. You insert both into your database, the normal phrase and the parent phrase.

    To show a "Trending Topics Admin" like Twitter's you do the following:

    // first select the top 10 parent phrases
    $sql1 = "SELECT parentPhrase, COUNT(*) as cnt FROM phrases GROUP BY parentPhrase ORDER BY cnt DESC LIMIT 0, 10";
    $sql2 = mysql_query($sql1);
    while ($sql3 = mysql_fetch_assoc($sql2)) {
        $parentPhrase = $sql3['parentPhrase'];
        $childPhrases = array(); // set up an array for the child phrases
        $fifthPart = round($sql3['cnt']*0.2);
        // now select all child phrases which make 20% of the parent phrase or more
        $sql4 = "SELECT phrase FROM phrases WHERE parentPhrase = '".$sql3['parentPhrase']."' GROUP BY phrase HAVING COUNT(*) >= ".$fifthPart;
        $sql5 = mysql_query($sql4);
        while ($sql6 = mysql_fetch_assoc($sql5)) {
            $childPhrases[] = $sql3['phrase'];
        }
        // now you have the parent phrase which is on the left side of the arrow in $parentPhrase
        // and all child phrases which are on the right side of the arrow in $childPhrases
    }
    

    Is this what you thought of, Broken Link? Would this work?