Search code examples
snowflake-cloud-data-platformsnowflake-schema

snowflake - header not detected by infer schema


The below code ignores the header in the csv file and the schema is being inferred as c1,c2,c3. The header columns are produced as just another data row. Below is sample data of relevant file

Series_reference    Period  Data_value  Suppressed  STATUS  UNITS   Magnitude   Subject Group   Series_title_1
BDCQ.SEA1AA 2011.06 80078       F   Number  0   Business Data Collection - BDC  Industry by employment variable Filled jobs
create or replace file format mybusinesscsv
type = csv
RECORD_DELIMITER = '\n'
skip_header = 0;
SELECT *
  FROM TABLE(
    INFER_SCHEMA(
      LOCATION=>'@quarterdatastg/employment_1.csv.gz',
      file_format=>'mybusinesscsv'

      )
    );

Solution

  • The point Nick is pointing to is the 'parse header' option also is required.

    we had two file formats, one for the LOAD and the other for the INFER stage:

    create or replace file format DB_NAME.SCHEMA_NAME.CSV_FORMAT 
        type = CSV RECORD_DELIMITER ='\n', FIELD_DELIMITER=',',
        SKIP_HEADER=1, PARSE_HEADER=false, ESCAPE_UNENCLOSED_FIELD='\\', 
       FIELD_OPTIONALLY_ENCLOSED_BY='\"';
    
    create or replace file format DB_NAME.SCHEMA_NAME.CSV_FORMAT_HEADER 
       type = CSV RECORD_DELIMITER ='\n', FIELD_DELIMITER=',',
       SKIP_HEADER=0, PARSE_HEADER=true, ESCAPE_UNENCLOSED_FIELD='\\', 
       FIELD_OPTIONALLY_ENCLOSED_BY='\"';
    

    thus the INFER was like:

                FROM TABLE(INFER_SCHEMA(
                          LOCATION => ''$$ || S3_FOLDER || $$'',
                          FILE_FORMAT => ''DB_NAME.SCHEMA_NAME.CSV_FORMAT_HEADER'',
                          IGNORE_CASE => TRUE
                    ))
    

    and where used in COPY's like:

     COPY INTO DB_NAME.SCHEMA_NAME.TABLE_NAME
          FROM @DB_NAME.SCHEMA_NAME.MEANINGFUL_NAME_STAGE/table_path_tokens
          FILE_FORMAT = DB_NAME.SCHEMA_NAME.CSV_FORMAT;