Search code examples
pythonhiveparquetduckdb

DuckDB insert the hive partitions into parquet file


I have jsonl files partitioned by user_id, and report_date. I am converting these jsonl files into parquet files and save them in the same folder using the following commands in DuckDB

jsonl_file_path ='/users/user_id=123/report_date=2024-04-30/data.jsonl'

out_path = '/users/user_id=123/report_date=2024-04-30/data.parquet'

db.sql(
        f"""
        COPY (
            SELECT * FROM read_json_auto(
                '{jsonl_file_path}',
                maximum_depth=-1,
                sample_size=-1,
                ignore_errors=true
            )
        )
        TO '{out_path}' (
            FORMAT PARQUET,
            ROW_GROUP_SIZE 100000,
            OVERWRITE_OR_IGNORE 1
        );
        """
    )

It works fine, but the problem is DuckDB is inserting the hive partition values into the parquet file which are user_id and report_date, these values are not in jsonl file. I tried to add hive_partitioning = false, but the problem still, anyone know how to solve this issue?


Solution

  • If I understand correctly, you want to do a partitioned write and need to use PARTITION_BY.

    When doing a partitioned write, you should not include the hive partition as part of your output path. The partitioned write will build those paths and file names for the case where there are many files per partition. You can template the file format using FILENAME_PATTERN.

    jsonl_file_path ='/users/user_id=123/report_date=2024-04-30/data.jsonl'
    
    out_path = '/users'
    file_pattern = 'data_{i}'
    
    db.sql(
            f"""
            COPY (
                SELECT * FROM read_json_auto(
                    '{jsonl_file_path}',
                    maximum_depth=-1,
                    sample_size=-1,
                    ignore_errors=true
                )
            )
            TO '{out_path}' (
                FORMAT PARQUET,
                PARTITION_BY (user_id, report_date),
                FILENAME_PATTERN {file_pattern}
                ROW_GROUP_SIZE 100000,
                OVERWRITE_OR_IGNORE 1
            );
            """
        )
    

    If you want to overwrite the single file, modify the projection in the SELECT not to use *. The * includes the "virtual" columns that are part of the hive table's partition. They are columns in the hive table stored in the path, not the physical parquet files.

    jsonl_file_path ='/users/user_id=123/report_date=2024-04-30/data.jsonl'
    
    out_path = '/users/user_id=123/report_date=2024-04-30/data.parquet'
    
    db.sql(
            f"""
            COPY (
                SELECT * EXCLUDE(user_id, report_date) FROM read_json_auto(
                    '{jsonl_file_path}',
                    maximum_depth=-1,
                    sample_size=-1,
                    ignore_errors=true
                )
            )
            TO '{out_path}' (
                FORMAT PARQUET,
                ROW_GROUP_SIZE 100000,
                OVERWRITE_OR_IGNORE 1
            );
            """
        )