Search code examples
javascriptpythonapache-arrowfeather

How to correctly read an Apache Arrow Feather file produced by pyarrow?


I have been unsuccessful to read an Apache Arrow Feather with javascript produced by a python script javascript library of Arrow.. I am using pyarrow and arrow/js from the Apache Arrow project.

I created a simple python script to create the Feather file:

import pyarrow as pa
import pyarrow.feather as feather

# create a simple feather table to assess reading in JS with arrow/js
int_array = pa.array(list(range(10)))
int_schema = pa.schema([pa.field('Numbers_schema', pa.uint32())])
int_table = pa.Table.from_arrays([int_array], schema=int_schema)

feather.write_feather(int_table, 'simple.arrow', version=2)

If I read that 'simple.arrow' file in python and output that for example in Jupyter notebook I get the expected result:

| |Numbers_schema|
|--|-------------|
|0|0|
|1|1|
|2|2|
|3|3|
|4|4|
|5|5| 

etc.

However, if I read the file with a simple Javascript implementation or with the arrow2csv.js implementation provided by the js library the resulting data looks something like below (neglect the indexes, the output is from the arrow2csv.js output using indexes starting from 1):

| |"Numbers_schema: UInt32"|
|--|------------------------------|
|1|40|
|2|0|
|3|407708164|
|4|679624800|
|5|8388608|

etc.

So basically, all the values that should be UInt32 are incorrect. To me it seems that the JS implementation doesn't read the Feather file correctly. Is this a bug or am I misunderstanding something with respect to the Feather file format and its use?

Best regards,

-Toni


Solution

  • By default, feather.write_feather() uses LZ4 compression, but the javascript library does not support either of the compression standards from R/Python/C++ implementations.

    If you want to be able to read using the Arrow-JS library, you must pass compression="uncompressed" as an argument when you write from Python.

    Usually when I try to read a compressed feather file in JS it errors because the file is shorter than the expected number of bytes; I would guess that your data here is short enough that instead, the JS implementation is grabbing some arbitrary bytes from the footer and trying to interpret them as part of the "Numbers" column.