Search code examples
dataframeamazon-s3pysparkaws-glueparquet

Glue dynamic frame is not populating from s3 bucket


I have a glue job that is not working because the dynamic frame is not populating from a parquet in s3.

I have pointed it directly to an object that has data in it, but the dynamic frame is still blank.

Example below

input_dyf = glueContext.create_dynamic_frame.from_options("s3", {
        "paths": ['s3://dev/.test/load_year=2023/load_month=2/load_day=22/.test.parquet'],
        "recurse": False,
        "groupFiles": "inPartition",
    },
    format = "parquet",
    transformation_ctx = "DataSource0"
)

I have similar glue jobs with all the same configurations (and bookmarks off), and this is the only one failing.


Solution

  • I've tested this on my end with a similar filename and path name. What I found was that the filename can't include a period (.) in it. The S3 path is fine to have a period in it, but the parquet file itself cannot. Working example:

    input_dyf = glueContext.create_dynamic_frame.from_options("s3", {
            "paths": ['s3://dev/.test/load_year=2023/load_month=2/load_day=22/test.parquet'],
            "recurse": False,
            "groupFiles": "inPartition",
        },
        format = "parquet",
        transformation_ctx = "DataSource0"
    )
    

    Removing the . from test.parquet seemed to solve this issue. Please test on your end and let me know.