Search code examples
phpregexperformancestring-matchingsuffix-tree

Match and replace emoticons in string - what is the most efficient way?


Wikipedia defines a lot of possible emoticons people can use. I want to match this list to words in a string. I now have this:

$string = "Lorem ipsum :-) dolor :-| samet";
$emoticons = array(
  '[HAPPY]' => array(' :-) ', ' :) ', ' :o) '), //etc...
  '[SAD]'   => array(' :-( ', ' :( ', ' :-| ')
);
foreach ($emoticons as $emotion => $icons) {
  $string = str_replace($icons, " $emotion ", $string);
}
echo $string;

Output:

Lorem ipsum [HAPPY] dolor [SAD] samet

so in principle this works. However, I have two questions:

  1. As you can see, I'm putting spaces around each emoticon in the array, such as ' :-) ' instead of ':-)' This makes the array less readable in my opinion. Is there a way to store emoticons without the spaces, but still match against $string with spaces around them? (and as efficiently as the code is now?)

  2. Or is there perhaps a way to put the emoticons in one variable, and explode on space to check against $string? Something like

    $emoticons = array( '[HAPPY]' => ">:] :-) :) :o) :] :3 :c) :> =] 8) =) :} :^)", '[SAD]' => ":'-( :'( :'-) :')" //etc...

  3. Is str_replace the most efficient way of doing this?

I'm asking because I need to check millions of strings, so I'm looking for the most efficient way to save processing time :)


Solution

  • If the $string, in which you want replace emoticons, is provided by a visitor of your site(I mean it's a user's input like comment or something), then you should not relay that there will be a space before or after the emoticon. Also there are at least couple of emoticons, that are very similar but different, like :-) and :-)). So I think that you will achieve better result if you define your emoticon's array like this:

    $emoticons = array(
        ':-)' => '[HAPPY]',
        ':)' => '[HAPPY]',
        ':o)' => '[HAPPY]',
        ':-(' => '[SAD]',
        ':(' => '[SAD]',
        ...
    )
    

    And when you fill all find/replace definitions, you should reorder this array in a way, that there will be no chance to replace :-)) with :-). I believe if you sort array values by length will be enough. This is in case your are going to use str_replace(). strtr() will do this sort by length automatically!

    If you are concerned about performance, you can check strtr vs str_replace, but I will suggest to make your own testing (you may get different result regarding your $string length and find/replace definitions).

    The easiest way will be if your "find definitions" doesn't contain trailing spaces:

    $string = strtr( $string, $emoticons );
    $emoticons = str_replace( '][', '', trim( join( array_unique( $emoticons ) ), '[]' ) );
    $string = preg_replace( '/\s*\[(' . join( '|', $emoticons ) . ')\]\s*/', '[$1]', $string ); // striping white spaces around word-styled emoticons