Search code examples
utf-8powerbuilderdatawindow

Powerbuilder: ImportFile of UTF-8 (Converting UTF-8 to ANSI)


My Powerbuilder version is 6.5, cannot use a higher version as this is what I am supporting.

My problem is, when I am doing dw_1.ImportFile(file) the first row and first column has a funny string like this:



Which I dont understand until I tried opening the file and saving it to a new text file and trying to import that new file.which worked flawlessly without the funny string.

My conclusion is that this is happening because the file is UTF-8 (as shown in NOTEPAD++) and the new file is Ansi. The file I am trying to import is automatically given by a 3rd party and my users dont want the extra job of doing this.

How do I force convert this files to ANSI in powerbuilder. If there is none, I might have to do a command prompt conversion, any ideas?


Solution

  • The weird  characters are the (optional) utf-8 BOM that tells editors that the file is utf-8 encoded (as it can be difficult to know it unless we encounter an escaped character above code 127). You cannot just rid it off because if your file contains any character above 127 (accents or any special char), you will still have garbage in your displayed data (for example: é -> é, -> €, ...) where special characters will become from 2 to 4 garbage chars.

    I recently needed to convert some utf-8 encoded string to "ansi" windows 1252 encoding. With version of PB10+, a reencoding between utf-8 and ansi is as simple as

    b = blob(s, encodingutf8!)
    s2 = string(b, encodingansi!)
    

    But string() and blob() do not support encoding specification before the release 10 of PB.

    What you can do is to read the file yourself, skip the BOM, ask Windows to convert the string encoding via MultiByteToWideChar() + WideCharToMultiByte() and load the converted string in the DW with ImportString().

    Proof of concept to get the file contents (with this reading method, the file cannot be bigger than 2GB):

    string ls_path, ls_file, ls_chunk, ls_ansi
    ls_path = sle_path.text
    int li_file
    if not fileexists(ls_path) then return
    
    li_file = FileOpen(ls_path, streammode!)
    if li_file > 0 then
        FileSeek(li_file, 3, FromBeginning!) //skip the utf-8 BOM
    
        //read the file by blocks, FileRead is limited to 32kB
        do while FileRead(li_file, ls_chunk) > 0
            ls_file += ls_chunk //concatenate in loop works but is not so performant
        loop
    
        FileClose(li_file)
    
        ls_ansi = utf8_to_ansi(ls_file)
        dw_tab.importstring( text!, ls_ansi)
    end if
    

    utf8_to_ansi() is a globlal function, it was written for PB9, but it should work the same with PB6.5:

    global type utf8_to_ansi from function_object
    end type
    
    type prototypes
    function ulong MultiByteToWideChar(ulong CodePage, ulong dwflags, ref string lpmultibytestr, ulong cchmultibyte, ref blob lpwidecharstr, ulong cchwidechar) library "kernel32.dll"
    function ulong WideCharToMultiByte(ulong CodePage, ulong dwFlags, ref blob lpWideCharStr, ulong cchWideChar, ref string lpMultiByteStr, ulong cbMultiByte, ref string lpUsedDefaultChar, ref boolean lpUsedDefaultChar) library "kernel32.dll"
    end prototypes
    
    forward prototypes
    global function string utf8_to_ansi (string as_utf8)
    end prototypes
    
    global function string utf8_to_ansi (string as_utf8);
    
    //convert utf-8 -> ansi
    //use a wide-char native string as pivot
    
    constant ulong CP_ACP = 0
    constant ulong CP_UTF8 = 65001
    
    string ls_wide, ls_ansi, ls_null
    blob lbl_wide
    ulong ul_len
    boolean lb_flag
    
    setnull(ls_null)
    lb_flag = false
    
    //get utf-8 string length converted as wide-char
    setnull(lbl_wide)
    ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, 0)
    //allocate buffer to let windows write into
    ls_wide = space(ul_len * 2)
    lbl_wide = blob(ls_wide)
    //convert utf-8 -> wide char
    ul_len = multibytetowidechar(CP_UTF8, 0, as_utf8, -1, lbl_wide, ul_len)
    //get the final ansi string length
    setnull(ls_ansi)
    ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, 0, ls_null, lb_flag)
    //allocate buffer to let windows write into
    ls_ansi = space(ul_len)
    //convert wide-char -> ansi
    ul_len = widechartomultibyte(CP_ACP, 0, lbl_wide, -1, ls_ansi, ul_len, ls_null, lb_flag)
    
    return ls_ansi
    end function