Search code examples
ruby-on-railsrubyunicode-stringunicode-normalization

How to convert Unicode styled forms into plain text


I need to convert user input like "𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–" into plain "ASCII" text, i.e. "jovy debbie".

The input comes in different styles, e.g. "π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”" or "π™ΆπšŽπšŸπš’πšŽπš•πš’πš— π™½πš’πšŒπš˜πš•πšŽ π™»πšžπš–πš‹πšŠπš".

Any Help will be appreciated, I already refer other stack overflow question but no luck :(


Solution

  • Those letters are from the Mathematical Alphanumeric Symbols block.

    Since they have a fixed offset to their ASCII counterparts, you could use tr to map them, e.g.:

    "𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–".tr("𝕒-𝕫", "a-z")
    #=> "jovy debbie"
    

    The same approach can be used for the other styles and to map lower / upper case, e.g.

    "π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”".tr("𝒂-𝒛𝑨-𝒁", "a-zA-Z")
    #=> "Jenica Dugos"
    

    This gives you full control over the character mapping.

    Alternatively, you could try Unicode normalization. The NFKC / NFKD forms should remove most formatting and seem to work for your examples:

    "𝕛𝕠𝕧π•ͺ π••π•–π•“π•“π•šπ•–".unicode_normalize(:nfkc)
    #=> "jovy debbie"
    
    "π‘±π’†π’π’Šπ’„π’‚ π‘«π’–π’ˆπ’π’”".unicode_normalize(:nfkc)
    #=> "Jenica Dugos"