Search code examples
node.jshtml-entities

Decode XML/HTML entities from iso-8859-1 charset in NodeJS


I'm receiving polish text from a SOAP action that has the polish diacritics encoded as XML entities, but as far as I can tell, they are not encoded in UTF-8 but ISO-8859-1 and I'm struggling to decode them properly in NodeJS.

Example text: Borek Fałęcki

Expected decoding result: Borek Fałęcki

Current result: Borek Fałęcki

While I achieved the correct result in PHP using following code:

echo html_entity_decode('Borek Fałęcki', ENT_QUOTES | ENT_SUBSTITUTE | ENT_XML1, 'ISO-8859-1');

I'm having no luck in doing the same in NodeJS. There aren't many complete packages to help with decoding html/xml entities, I have used both entites and html-entities but they provide the same results, and none of them seem to have any charset settings.

const { decode, encode } = require('html-entities');
const entities = require('entities');

const txt = 'Borek Fałęcki';
console.log('html-entities decode', decode(txt));
console.log('utf8-encoding', encode('Borek Fałęcki', {
    mode: 'nonAsciiPrintable',
    numeric: 'decimal',
    level: 'xml',
}));
console.log('entities decode', entities.decodeXML(txt));

Output:

html-entities decode Borek Fałęcki
utf8-encoding Borek Fałęcki
entities decode Borek Fałęcki

As we can see, when encoded with UTF-8 there are single entities for each character:

ł = ł
ę = ę

With ISO-8859-1, there are 2 entities per character. I have no more ideas how to achieve the same decoding result as in PHP. If there were no entities, I could just convert the encoding to UTF-8 but with entities I have no idea how to do it properly. I cannot get the other side to send me UTF-8, since this is a closed old protocol that I have no control of.


Solution

  • The correct XML encoding of Borek Fałęcki is Borek Fałęcki. The SOAP action XML that you receive is wrongly encoded.

    However, the following expression converts it as needed:

    Buffer.concat(
      "Borek Fałęcki"
      .match(/[^&]+|&#\d+;/g)
      .map(c => c[0] === "&"
        ? Buffer.of(Number(c.substring(2, c.length - 1)))
        : Buffer.from(c))
    ).toString()