Search code examples
regexperlunicodesuperscriptunicode-normalization

Replace Unicode numeral subscript or superscript with plain numeral


How do I replace a Unicode numeral subscript or superscript (eg, ) with the corresponding numeral (ie, 2) using regular expressions? I can of course replace each of them separately, but that is ten lines of code...

I am implementing this in Perl but that should not really matter.


Solution

  • Here from the unisupers script is a Perl function to convert to Unicode superscripts:

    sub convert_to_superscripts (_) {
       my $string = $_[0];
       $string =~ tr[+−=()0123456789AaÆᴂɐɑɒBbcɕDdðEeƎəɛɜɜfGgɡɣhHɦIiɪɨᵻɩjJʝɟKklLʟᶅɭMmɱNnɴɲɳŋOoɔᴖᴗɵȢPpɸrRɹɻʁsʂʃTtƫUuᴜᴝʉɥɯɰʊvVʋʌwWxyzʐʑʒꝯᴥβγδθφχнნʕⵡ]
                    [⁺⁻⁼⁽⁾⁰¹²³⁴⁵⁶⁷⁸⁹ᴬᵃᴭᵆᵄᵅᶛᴮᵇᶜᶝᴰᵈᶞᴱᵉᴲᵊᵋᶟᵌᶠᴳᵍᶢˠʰᴴʱᴵⁱᶦᶤᶧᶥʲᴶᶨᶡᴷᵏˡᴸᶫᶪᶩᴹᵐᶬᴺⁿᶰᶮᶯᵑᴼᵒᵓᵔᵕᶱᴽᴾᵖᶲʳᴿʴʵʶˢᶳᶴᵀᵗᶵᵁᵘᶸᵙᶶᶣᵚᶭᶷᵛⱽᶹᶺʷᵂˣʸᶻᶼᶽᶾꝰᵜᵝᵞᵟᶿᵠᵡᵸჼˤⵯ];
       return $string;
    }
    

    And from the unisubs script is one for subscripts:

    sub convert_to_subscripts (_) {
       my $string = $_[0];
       $string =~ tr[+−=()0123456789aeəhijklmnoprstuvxβγρφχ]
                    [₊₋₌₍₎₀₁₂₃₄₅₆₇₈₉ₐₑₔₕᵢⱼₖₗₘₙₒₚᵣₛₜᵤᵥₓᵦᵧᵨᵩᵪ];
       return $string;
    }
    

    You just have to go the other way.

    Another and simpler approach is simply to use the k-compat normalizations, which just return the base characters instead of their upper/lower versions. I haven’t checked these to see that they are all the inverses of the functions above. You can play with them using the nfkd and nfkc scripts.