Search code examples
pythonpandasparquetpyarrow

Using pyarrow how do you append to parquet file?


How do you append/update to a parquet file with pyarrow?

import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq


 table2 = pd.DataFrame({'one': [-1, np.nan, 2.5], 'two': ['foo', 'bar', 'baz'], 'three': [True, False, True]})
 table3 = pd.DataFrame({'six': [-1, np.nan, 2.5], 'nine': ['foo', 'bar', 'baz'], 'ten': [True, False, True]})


pq.write_table(table2, './dataNew/pqTest2.parquet')
#append pqTest2 here?  

There is nothing I found in the docs about appending parquet files. And, Can you use pyarrow with multiprocessing to insert/update the data.


Solution

  • I ran into the same issue and I think I was able to solve it using the following:

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    
    chunksize=10000 # this is the number of lines
    
    pqwriter = None
    for i, df in enumerate(pd.read_csv('sample.csv', chunksize=chunksize)):
        table = pa.Table.from_pandas(df)
        # for the first chunk of records
        if i == 0:
            # create a parquet write object giving it an output file
            pqwriter = pq.ParquetWriter('sample.parquet', table.schema)            
        pqwriter.write_table(table)
    
    # close the parquet writer
    if pqwriter:
        pqwriter.close()