Search code examples
rhtml-encode

How to program a "decimal HTML decoder"?


I wish to create (in the R language) a "decimal HTML decoder" such as the one implemented in this website:

http://www.hashemian.com/tools/html-url-encode-decode.php

But I'm not sure where to start, could someone propose any pointers on what to read/which translation table (or formula) to use?

My original motivation for this will be the decoding of Hebrew characters. (for example, the translation of something like this:

שלום

To this:

שלום

)

(hat tip goes to Matt Shotwell for the pointers)


Solution

  • inp <- "&#x5E9;&#x5DC;&#x5D5;&#x5DD;"
    nohash <- sub("#", "0", strsplit(inp, "&")[[1]])  # cvrt # to 0
    nohash
    # [1] ""       "0x5E9;" "0x5DC;" "0x5D5;" "0x5DD;"
    strtoi( sub(";", "", nohash) )  # remove trailing ";" and cvrt to dec
    # [1]    0 1513 1500 1493 1501
    

    Edit the time has expired on adding to my comment so I'll add this link that seems to have a conversion table: