Search code examples
amazon-web-servicesamazon-s3amazon-kinesis-firehoseaws-glue-data-catalogglue-crawler

Why is Kinesis or Crawler creating partitions in my data?


Context: I'm using kinesis to stream data from my lambda into an S3 bucket according to a glue schema. Then I run a crawler on my S3 bucket to catalog my data. My data, when written to the kinesis firehose has the following attributes: 'dataset_datetime, attr1, attr2, attr3, attr4...'. I do not define any partitions in neither the data written from lambda nor in my kinesis firehose, nor in my glue catalog. However, when data is stored inside my S3 bucket, it's stored in the following dir structure:

-year -month -day -hour -dataFile.parquet

Then, when I run my crawler over it, my crawler creates 4 additional partition keys which map to year, month, day and hour. I don't want these attributes being created...

Question: Why does glue crawler create these additional attributes and how can I prevent it from creating them? Or, how can I prevent kinesis from creating the above dir structure inside S3 and instead just dump the file with some timestamp?


Solution

  • Why is Kinesis or Crawler creating partitions in my data?

    To clarify, Kinesis Firehose is partitioning the data as it writes it to S3. The default behavior is to partition the data by year, month, day, and hour.

    Why does glue crawler create these additional attributes and how can I prevent it from creating them?

    Glue Crawler creates partitions (or tables) based on the schema of the data being crawled. If schemas for files in the include path are similar, then the crawler will create a single table with partitions for each subfolder from the include path to the file.

    Example: If the include path is s3://<bucket>/prefix/ and file1.parquet and file2.parquet have a similar schema, then the crawler will create 1 table with 4 partition columns (1 column for 2022 subfolder, 1 column for 07 subfolder, etc).

    s3://<bucket>/prefix/2022/07/27/08/file1.parquet
    s3://<bucket>/prefix/2022/07/27/09/file2.parquet
    

    You can't directly prevent the crawler from creating partitions. You can manipulate the include path to go deeper into the subfolder directory (e.g. set include path to s3://<bucket>/prefix/2022/07/27/08), which will prevent partitions from being created depending on how deep the include path is. However, this is probably not what you want to do since it will result in multiple tables being created.

    Reference: How does a crawler determine when to create partitions? (AWS)

    Or, how can I prevent kinesis from creating the above dir structure inside S3 and instead just dump the file with some timestamp?

    You may be able to achieve what you want with Dynamic Partitioning. Dynamic partitioning allows you to override the default year/month/day/hour partitioning. If your schema has some static value field, you could theoretically configure Firehose to partition the data based on that field and then configure the Glue Crawler include path to include that partition subfolder.

    Example: Firehose is configured to dynamically partition data based on the static_field schema (static_field always has the same value). If the Glue Crawler include path is set to s3://<bucket>/static_field=value/, then a single table will be created with only columns from the schema (no partitions).

    s3://<bucket>/static_field=value/file1.parquet
    s3://<bucket>/static_field=value/file2.parquet
    

    Reference: Dynamic Partitioning in Kinesis Data Firehose (AWS)

    Suggestion: There are a few different ways to manipulate the data/partitioning. My suggestion is to not go against the default behavior for Firehose and Glue Crawler. Instead, consider how the partitioning implementation can be abstracted from the clients/consumers of this data. For example, create a materialized view that excludes partition columns.