I am parsing a binary file to extract the text content. This is for a library that can be run in either a Node environment or a web browser. I need to convert all characters to be the human-readable versions of the encoding. So I receive an example string like
'Señor and salvación and Number%3A 1234%3B %06%88'
and I expect the output to be
'Señor and salvación and Number: 1234; ♠'
Currently I am using a mixture of decoding and escaping strings using a function I found on another SO question. I am absolutely OK with throwing it away in favor of something else that works better. I know what I am doing is not ideal at all, but I am unsure of what I need to do to make this work correctly. The below example shows that function and the steps to get to final output which is close, but not perfect.
The other problem is that using decodeURIComponent
will sometimes throw a URIError: URI malformed
error depending on what kinds of input I give it
function escapeString(str) {
//A replacement for the deprecated escape method
//https://stackoverflow.com/a/37303214/79677
const allowed = 'ABCDEFGHIJKLMNOPQRSTUVWXYZabcdefghijklmnopqrstuvwxyz0123456789@*_+-./,';
str = str.toString();
const len = str.length;
let R = '';
let k = 0;
let S = '';
let chr = '';
let ord = 0;
while (k < len) {
chr = str[k];
if (allowed.indexOf(chr) !== -1) {
S = chr;
} else {
ord = str.charCodeAt(k);
if (ord < 256) {
S = '%' + ('00' + ord.toString(16)).toUpperCase().slice(-2);
} else {
S = '%u' + ('0000' + ord.toString(16)).toUpperCase().slice(-4);
}
}
R += S;
k++;
}
return R;
}
const str = 'Señor and salvación and Number%3A 1234%3B %06%88';
//Expecting: 'Señor and salvación and Number: 1234; ♠'
console.log(1, str);
console.log(2, escapeString(str))
console.log(3, decodeURIComponent(escapeString(str)));
console.log(4, unescape(decodeURIComponent(escapeString(str))));
How can I properly, correctly, and consistently decode/convert my strings to the human-readable versions?
You face a (mix of) mojibake case (example in Python for its universal intelligibility):
from urllib.parse import unquote
string = 'Señor and salvación and Number%3A 1234%3B';
unquote( string).encode( 'cp1252').decode( 'utf-8');
'Señor and salvación and Number: 1234;'
Rewritten to JavaScript (sorry for lame and dull-witted code):
function byteToUint8Array(byteArray) {
// https://stackoverflow.com/a/34821126/3439404
var uint8Array = new Uint8Array(byteArray.length);
for(var i = 0; i < uint8Array.length; i++) {
uint8Array[i] = byteArray[i];
}
return uint8Array;
};
function getBytes(txt) {
//
var bytes = [];
for (var i = 0; i < txt.length; ++i) {
bytes.push(txt.charCodeAt(i));
}
return byteToUint8Array(bytes);
};
var decoder = new TextDecoder("utf-8");
const str = 'Señor and salvación and Number%3A 1234%3B' // %06%88';
//Expecting: 'Señor and salvación and Number: 1234;' // ♠'
console.log(1, str);
console.log(5, decoder.decode( getBytes( decodeURIComponent(str))));
Note that two trailing characters in your string (percent encoded as %06%88
) are non-printable codes
- `␆` (U+0006, *Acknowledge*)
- `` (U+0088, *Character Tabulation Set*)