Search code examples
character-encodingrebolfile-conversionrebol3

Perform file encoding conversion with Rebol 3


I want to use Rebol 3 to read a file in Latin1 and convert it to UTF-8. Is there a built-in function I can use, or some external library? Where I can find it?


Solution

  • Rebol has an invalid-utf? function that scours a binary value for a byte that is not part of a valid UTF-8 sequence. We can just loop until we've found and replaced all of them, then convert our binary value to a string:

    latin1-to-utf8: function [binary [binary!]][
        mark: :binary
        while [mark: invalid-utf? mark][
            change/part mark to char! mark/1 1
        ]
        to string! binary
    ]
    

    This function modifies the original binary. We can create a new string instead that leaves the binary value intact:

    latin1-to-utf8: function [binary [binary!]][
        mark: :binary
        to string! rejoin collect [
            while [mark: invalid-utf? binary][
                keep copy/part binary mark  ; keeps the portion up to the bad byte
                keep to char! mark/1        ; converts the bad byte to good bytes
                binary: next mark           ; set the series beyond the bad byte
            ]
            keep binary                     ; keep whatever is remaining
        ]
    ]
    

    Bonus: here's a wee Rebmu version of the above—rebmu/args snippet #{DECAFBAD} where snippet is:

    ; modifying
    IUgetLOAD"invalid-utf?"MaWT[MiuM][MisMtcTKm]tsA
    
    ; copying
    IUgetLOAD"invalid-utf?"MaTSrjCT[wt[MiuA][kp copy/partAmKPtcFm AnxM]kpA]