I'm currently building a small command-line tool with Node and node-xlsx to parse and process some Excel files, and I noticed that whenever the input contains the sequence
_X2069_
node-xlsx seems to convert it to Unicode character U+2069 (Pop Directional Isolate).
Here's a minimal example:
const xlsx = require('node-xlsx');
function getData(excelSheet) {
data = [];
excelData = excelSheet[0]["data"]
length = 0
// Find the number of rows with actual data and filter out blank rows at the end
for (let i = 0; i < excelData.length; i++) {
if (excelData[i][0] != undefined) {
data.push([]);
data[i].push(excelData[i][0]); // only first column for now
}
}
return data;
}
const workSheet = xlsx.parse('./input.xlsx');
let rows = getData(workSheet);
console.log(rows);
And here's my input (contents of input.xlsx):
_X2069_
_Y2069_
_Z2069_
The output is:
[ [ '\u2069' ], [ '_Y2069_' ], [ '_Z2069_' ]
which appears like:
My current workaround is to replace all occurrences of U+2069 with the original string, but that seems hacky, and I'm afraid that 2069 isn't the only value that is automagically converted.
How can I prevent this erroneous conversion?
The root cause is a bug in sheetjs (the library that's being used by node-xlsx). From the issue discussion:
In XLSX, certain unicode characters are encoded using a special representation
_x####_
based on the hexadecimal code. For example, the string_x2069_
is actually stored as_x005F_x2069_
(where the first underscore is encoded as_x005F_
, the character code for_
)The current code treats the x as case insensitive, so
_X2069_
is treated as the encoded version of the string.
It's fixed in the current development version (_X
is no longer treated as "special", only _x
).