Search code examples
perlutf-8decodeutf8-decode

How "Expensive" is Perl's Encode::Detect::Detector


I've been having problems with "gremlins" from different encodings getting mixed into form input and data from a database within a Perl program. At first, I wasn't decoding, and smart quotes and similar things would generate multiple gibberish characters; but, blindly decoding everything as UTF-8 caused older Windows-1252 content to be filled with question marks.

So, I've used Encode::Detect::Detector and the decode() function to detect and decode all POST and GET input, along with data from a SQL database (the decoding process probably occurs on 10-20 strings of text each time a page is generated now). This seems to clean things up so UTF-8, ASCII and Windows-1252 content all display properly as UTF-8 output (as I've designated in the HTML headers):

    my $encoding_name = Encode::Detect::Detector::detect($value);   
    eval { $value = decode($encoding_name, $value) };

My question is this: how resource heavy is this process? I haven't noticed a slowdown, so I think I'm happy with how this works, but if there's a more efficient way of doing this, I'd be happy to hear it.


Solution

  • The answer is highly application-dependent, so the acceptability of the 'expense' accrued is your call.

    The best way to quantify the overhead is through profiling your code. You may want to give Devel::NYTProf a spin.

    Tim Bunce's YAPC::EU presentation provide more details about the module.