Search code examples
csvsnowflake-cloud-data-platformdata-import

Snowflake COPY INTO from stage losing rows


I've been pushing CSV data into snowflake using a PUT command to stage the file and then COPY INTO to get the contents into the target table:

snowsql -c myConnection -q 'PUT file://C:\\MyData.csv @~/MyStagingArea;'

snowsql -c myConnection -q 'USE WAREHOUSE MY_WH; COPY INTO MY_TABLE FROM @~/MyStagingArea/MyData.csv.gz FILE_FORMAT = (type=csv validate_utf8=false null_if=('''') field_optionally_enclosed_by=''""'');'

For the most part, this works fine. However with very large files I split them up into chunks of 9,000,000 and load them using a powershell script. Eventually, I noticed that with these larger files, rows were being missed.

Going through each file upload I discovered the problem. Somewhere along the line, in some instances, rows are being lost but the commands aren't registering any errors.

+----------------------------------+
| status                           |
|----------------------------------|
| Statement executed successfully. |
+----------------------------------+
1 Row(s) produced. Time Elapsed: 0.175s
+----------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
| file                         | status | rows_parsed | rows_loaded | error_limit | errors_seen | first_error | first_error_line | first_error_character | first_error_column_name |
|----------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------|
| DMyStagingArea/MyData.csv.gz | LOADED |     8999997 |     8999997 |           1 |           0 | NULL        |             NULL |                  NULL | NULL                    |
+----------------------------------------------------------+--------+-------------+-------------+-------------+-------------+-------------+------------------+-----------------------+-------------------------+
1 Row(s) produced. Time Elapsed: 31.520s

That's only three rows short, but this particular load is over 30 files long and by the end, it drops something like 40 million rows without errors. There are no errors in my snowsql log files either.

Given there's no error messages and the size of the data, I'm at something of a loss as to where to begin looking for the problem. Does anyone have any insight into what might cause this process to drop rows without giving an error?


Solution

  • One plausible explanation is that CSV contains multi-line entries(i.e. entry with line break/carriage return that is quoted within ").

    Example:

    col1,col2
    1,"This
    entry has two lines"
    

    For that reason direct comparison of lines vs loaded rows may produce different results.