Search code examples
javascriptxlsxnode-xlsx

Weird character conversion when _X2069_ is contained in input sheet for node-xlsx


I'm currently building a small command-line tool with Node and node-xlsx to parse and process some Excel files, and I noticed that whenever the input contains the sequence

_X2069_

node-xlsx seems to convert it to Unicode character U+2069 (Pop Directional Isolate).

Here's a minimal example:

const xlsx = require('node-xlsx');

function getData(excelSheet) {
    data = [];
    excelData = excelSheet[0]["data"]
    length = 0
    // Find the number of rows with actual data and filter out blank rows at the end
    for (let i = 0; i < excelData.length; i++) {
        if (excelData[i][0] != undefined) {
          data.push([]);
          data[i].push(excelData[i][0]); // only first column for now
        }
    }
    return data;
}

const workSheet = xlsx.parse('./input.xlsx');
let rows = getData(workSheet);
console.log(rows);

And here's my input (contents of input.xlsx):

_X2069_
_Y2069_
_Z2069_

The output is:

[ [ '\u2069⁩' ], [ '_Y2069_' ], [ '_Z2069_' ]

which appears like:

screenshot of outtput containing U+2069

My current workaround is to replace all occurrences of U+2069 with the original string, but that seems hacky, and I'm afraid that 2069 isn't the only value that is automagically converted.

How can I prevent this erroneous conversion?


Solution

  • The root cause is a bug in sheetjs (the library that's being used by node-xlsx). From the issue discussion:

    In XLSX, certain unicode characters are encoded using a special representation _x####_ based on the hexadecimal code. For example, the string _x2069_ is actually stored as _x005F_x2069_ (where the first underscore is encoded as _x005F_, the character code for _)

    The current code treats the x as case insensitive, so _X2069_ is treated as the encoded version of the string.

    It's fixed in the current development version (_X is no longer treated as "special", only _x).