Search code examples
regexperlscriptingperlscript

Removal of special characters from string using perl script


I have a string like below

stringinput = Sweééééôden@

I want to get output like

stringoutput = Sweden

the spl characters ééééô and @ has to be removed.

Am using

$stringoutput = `echo $stringinput | sed 's/[^a-z  A-Z 0-9]//g'`;

I am getting result like Sweééééôden but ééééô is not getting removed.

Can you please suggest what I have to add


Solution

  • You need to use LC_ALL=C before sed command to make [A-Za-z] character class create ranges as per ASCII table:

    stringoutput=$(echo $stringinput | LC_ALL=C sed 's/[^A-Za-z0-9]//g')
    

    See the online demo:

    stringinput='Sweééééôden@';
    stringoutput=$(echo $stringinput | LC_ALL=C sed 's/[^A-Za-z0-9]//g');
    echo "$stringoutput";
    # => Sweden
    

    See POSIX regex reference:

    In the default C locale, the sorting sequence is the native character order; for example, ‘[a-d]’ is equivalent to ‘[abcd]’. In other locales, the sorting sequence is not specified, and ‘[a-d]’ might be equivalent to ‘[abcd]’ or to ‘[aBbCcDd]’, or it might fail to match any character, or the set of characters that it matches might even be erratic. To obtain the traditional interpretation of bracket expressions, you can use the ‘C’ locale by setting the LC_ALL environment variable to the value ‘C’.

    In Perl, you could simply use

    my $stringinput = 'Sweééééôden@';
    my $stringoutput = $stringinput =~ s/[^A-Za-z0-9]+//gr;
    print $stringoutput;
    

    See this online demo.