Search code examples
node.jsutf-8character-encodingnpm-scripts

parameter from package.json script (Encoding problem)


https://nodejs.org/docs/latest/api/process.html#processargv https://www.golinuxcloud.com/pass-arguments-to-npm-script/

passing a parameter by invoking a script in package.json as follows:

--pathToFile=./ESMM/Parametrização_Dezembro_PS1_2022.xlsx

in code retrieve that parameter as argument

const value = process.argv.find( element => element.startsWith( `--pathToFile=` ) );
const pathToFile=value.replace( `--pathToFile=` , '' );

The string that's obtain seems to be in the wrong format/encoding

./ESMM/Parametrização_Dezembro_PS1_2022.xlsx

I tried converting to latin1 (other past issues were fixed with this encoding)

const latin1Buffer = buffer.transcode(Buffer.from(pathToFile), "utf8", "latin1");
const latin1String = latin1Buffer.toString("latin1");

but still don't get the string in the correct encoding:

./ESMM/Parametriza?º?úo_Dezembro_PS1_2022.xlsx

My package.json is in UTF-8.

My current locale is (chcp): Active code page: 850

OS: Windows

This seems to be related to:

will try those configurations

    const min = parseInt("0xD800",16), max = parseInt("0xDFFF",16);
    console.log(min);//55296
    console.log(max);//57343

    let textFiltered = "",specialChars = 0;
    for(let charAux of pathToFile){
        const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
        console.log(hexChar)
        const intChar = parseInt(hexChar,16);
        if(hexChar.length > 2){
        //if(intChar>min && intChar<max){
            //console.log(Buffer.from(charAux, 'utf8').toString('hex'))
            specialChars++;
            console.log(`specialChars(${specialChars}): ${hexChar}`);
        }else{
            textFiltered += String.fromCharCode(intChar);
        }
    }

console.log(textFiltered); //normal characters

./ESMM/Parametrizao_Dezembro_PS1_2022.xlsx

console.log(specialChars(${specialChars}): ${hexChar}); //specialCharacters

specialChars(1): e2949c  
specialChars(2): c2ba  
specialChars(3): e2949c  
specialChars(4): c3ba

seems that e2949c hex value to indicate a special character since it repeats and 0xc2ba should be able to convert to "ç" and 0xc3ba to "ã" idealy still trying to figure that out.

Each Unicode codepoint can be written in a string with \u{xxxxxx} where xxxxxx represents 1–6 hex digits


Solution

  • As @JosefZ indicated but for Python, in my case gona use a direct conversion since will alls have the keyword "Parametrização" as part of the parameter.

    The probleam that encountered in this case is that my package.json and my script are in the correct format UTF8 as stated by @tripleee (thanks for the help providade) but process.argv that returns <string[]> that basicaly UTF16... so my solution is deal with the ├ that in hex is "e2949c" and retrive the correct characters:

    const UTF8_Character = "e2949c" //├
    //for this cases use this json/array that haves the correct encoding
    const personalized_encoding = {
        "c2ba": "ç",
        "c3ba": "ã"
    }
    
    let textFiltered = "",specialChars = 0;
    for(let charAux of pathToFile){
        const hexChar = Buffer.from(charAux, 'utf8').toString('hex');
        //console.log(hexChar)
        const intChar = parseInt(hexChar,16);
        if(hexChar.length > 2){
            if(hexChar === UTF8_Character) continue;
            specialChars++;
            //console.log(`specialChars(${specialChars}): ${hexChar}`);
            textFiltered += personalized_encoding[hexChar];
        }else{
            textFiltered += String.fromCharCode(intChar);
        }
    }
    
    console.log(textFiltered);