Search code examples
pythonpandassas

Python pandas read_sas with chunk size option fails with value error on index mismatch


I have a very large SAS file that won't fit in memory of my server. I simply need to convert to parquet formatted file. To do so, I am reading it in chunks using the chunksize option of the read_sas method in pandas. It is mostly working / doing its job. Except, it fails with the following error after a while.

This particular SAS file has 79422642 rows of data. It is not clear why it fails in the middle.

import pandas as pd

filename = 'mysasfile.sas7bdat'
SAS_CHUNK_SIZE = 2000000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
    print(sasDf.shape)


(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
(2000000, 184)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 340, in __next__
    da = self.read(nrows=self.chunksize or 1)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 742, in read
    rslt = self._chunk_to_dataframe()
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/io/sas/sas7bdat.py", line 795, in _chunk_to_dataframe
    rslt[name] = pd.Series(self._string_chunk[js, :], index=ix)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/series.py", line 461, in __init__
    com.require_length_match(data, index)
  File "/opt/anaconda3/lib/python3.10/site-packages/pandas/core/common.py", line 571, in require_length_match
    raise ValueError(
ValueError: Length of values (2000000) does not match length of index (1179974)

I just tested the same logic of the code on a smaller SAS file with fewer rows using a smaller chunk size as follows, and it seems to work fine without any errors, and also handles the last remaining chunk that is smaller than the chunk size parameter:

filename = 'mysmallersasfile.sas7bdat'
SAS_CHUNK_SIZE = 1000

sas_chunks = pd.read_sas(filename, chunksize = SAS_CHUNK_SIZE, iterator = True)
for sasDf in sas_chunks:
     print(sasDf.shape)

(1000, 5)
(1000, 5)
(1000, 5)
(1000, 5)
(983, 5)


Solution

  • Perhaps try this code:

    import pandas as pd
    import pyarrow as pa
    import pyarrow.parquet as pq
    
    filename = 'mysasfile.sas7bdat'
    output_filename = 'output.parquet'
    SAS_CHUNK_SIZE = 2000000
    
    writer = None  # initialize writer
    
    sas_chunks = pd.read_sas(filename, chunksize=SAS_CHUNK_SIZE, iterator=True)
    
    for i, sasDf in enumerate(sas_chunks):
        print(f"Processing chunk {i+1} with shape {sasDf.shape}")
        
        table = pa.Table.from_pandas(sasDf) # convert pandas DF to Arrow table
            
        if writer is None:
            # Create new Parquet file with 1st chunk
            writer = pq.ParquetWriter(output_filename, table.schema)
        
        writer.write_table(table)  # write Arrow Table to Parquet file
    
    if writer:
        writer.close()`
    
    1. It reads in chunks using pd.read_sas function
    2. pyarrow.parquet.ParquetWriter writes the data to a Parquet file while allowing appending data in chunks, which is suitable for such large datasets
    3. Each chunk is converted to a pyarrow.Table and written to the Parquet file