Search code examples
sqlamazon-web-servicesamazon-s3amazon-redshift

Error copying parquet file to Redshift having comma separated data in one column


I'm trying to copy the parquet files located in S3 to Redshift and it fails due to one column having comma separated data. Does anyone know how to handle such scenario in parquet files?

Sample Parquet data in file

"column_1" : "data1"
"column_2" : "data2"
....
"column_16" : "test1, test2"

The data in the specific column has comma separated values as "test1,test2"

Redshift Copy Command

COPY schema.table_name
FROM 's3://path/to/parquetfiles/'
IAM_ROLE 'iam_role'
FORMAT AS PARQUET

Error

ERROR: Spectrum Scan Error Detail: ----------------------------------------------- error: Spectrum Scan Error code: 15007 context: Unmatched number of columns between table and file. Table columns: 20, Data columns: 21, File name: https://s3-path.snappy.parquet query: 1212221 location: dory_util.cpp:1445 process: worker_thread [pid=21520] ----------------------------------------------- [ErrorId: 1-63864993-580523e75d18ashsd88894

I tried using ACCEPTINVCHARS but it does not seem to handle this situation.

Expecting the data in column_16 to be copied as it is

column_16
"test1,test2"

Solution

  • Posting an answer here for anyone else if such issue arises.

    The issue was parquet file in total had 21 columns.

    • When I query the parquet file via S3 select and see the parquet file data in JSON, it was showing me only 20 columns (as 1 column was having NULL value) but the JSON output for parquet file did not show me that column

    • When I query the parquet file via PySpark I was able to see those 21 columns/data in parquet file. If some column has NULL value the data was marked as "None" in parquet.

    When I created the redshift table with 21 columns and tried to perform COPY to REDSHIFT, it still gave the same spectrum error. As a temporary solution, I ran a Glue crawler against it and Glue crawler was able to get all the 21 columns and add NULL wherever the data was NULL, but direct COPY to REDSHIFT could not handle it. I'm not 100% sure but it seems the REDSHIFT COPY issue that it cannot read/recognize the NULL value column from parquet files and throws an error.

    As a final solution, In my PySpark code I dumped the results with 21 columns (just as before) but instead of "NULL" or "None" values inserted them as BLANK (double quotes - " ") wherever the given column had no data in it. Further, tried to do COPY to REDSHIFT for this new results having NULL values as "" (double quotes) and the COPY was successful