I found a similar question here: Read CSV with PyArrow
In this answer it references sys.stdin.buffer and sys.stdout.buffer, but I am not exactly sure how that would be used to write the .arrow file, or name it. I can't seem to find the exact information I am looking for in the docs for pyarrow. My file will not have any nans, but it will have a timestamped index. The file is ~100 gb, so loading it into memory simply isn't an option. I tried changing the code, but as I assumed, the code ended up overwriting the previous file every loop.
***This is my first post. I would like to thank all the contributors who answered 99.9% of my other questions before I had even the asked them.
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
writer = None
for split in pd.read_csv(sys.stdin.buffer, chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
writer.write_table(table)
writer.close()
if __name__ == "__main__":
main()
Below is the code I used in the command line
>cat data.csv | python test.py
As suggested by @Pace, you should consider moving the output file creation outside of the reading loop. Something like this:
import sys
import pandas as pd
import pyarrow as pa
SPLIT_ROWS = 1 ### used one line chunks for a small test
def main():
# Write out to file
with pa.OSFile('test.arrow', 'wb') as sink: ### no append mode yet
with pa.RecordBatchFileWriter(sink, table.schema) as writer:
for split in pd.read_csv('data.csv', chunksize=SPLIT_ROWS):
table = pa.Table.from_pandas(split)
writer.write_table(table)
if __name__ == "__main__":
main()
You also don't have to use sys.stdin.buffer
if you would prefer to specify specific input and output files. You could then just run the script as:
python test.py
By using with
statements, both writer
and sink
will be automatically closed afterwards (in this case when main()
returns). This means it should not be necessary to include an explicit close()
call.