Search code examples
pandasamazon-s3amazon-iamgeopandas

Access denied during geopandas read parquet from s3 bucket


After moving the parquet file containing the locally created geodataframe to s3, I tried to read the file within AWS Glue as follows.

import geopandas as gpd
test_gdf = gpd.read_parquet("s3://bucket_name/key/file.parquet")

However, OS Error occurred as follows

OSError: When getting information for key 'key/file.parquet' in bucket 'bucket_name': AWS Error ACCESS_DENIED during HeadObject operation: No response body.

What I found strange was that when I run pandas.read_parquet, it runs successfully.

import pandas as pd
test_gdf = pd.read_parquet("s3://bucket_name/key/file.parquet")

However, I confirmed that reading a geodataframe by pandas and then converting it back to geodataframe takes a lot of time.

Therefore, I want to read the parquet file directly through geopandas.

Referring to other questions, there were issues with IAM Role or s3 bucket policy, so I checked them.

Policy at AWS Glue Role

{
...
            "Action": [
                "s3:*"
            ],
            "Effect": "Allow",
            "Resource": [
                "*"
            ]
...
}

S3 Bucket Policy

{
    "Version": "2012-10-17",
    "Id": "PolicyForDatalakeBucket",
    "Statement": [
        {
            "Sid": "denyInsecureTransport",
            "Effect": "Deny",
            "Principal": "*",
            "Action": "s3:*",
            "Resource": [
                "arn:aws:s3:::bucket_name/*",
                "arn:aws:s3:::bucket_name"
            ],
            "Condition": {
                "Bool": {
                    "aws:SecureTransport": "false"
                },
                "ArnNotEquals": {
                    "aws:SourceArn": "arn:aws:iam::IAM_USER:role/GLUE_ROLE"
                }
            }
        },
        {
            "Sid": "",
            "Effect": "Allow",
            "Principal": {
                "AWS": [
                    "arn:aws:iam::IAM_USER:role/GLUE_ROLE",
                    "arn:aws:iam::IAM_USER:root"
                ]
            },
            "Action": [
                "s3:GetBucketAcl",
                "s3:ListBucket",
                "s3:GetObject",
                "s3:PutObject"
            ],
            "Resource": [
                "arn:aws:s3:::BUCKET_NAME/*",
                "arn:aws:s3:::BUCKET_NAME"
            ]
        }
    ]
}

What needs to be resolved so that geopandas can successfully read parquet files from s3?


Solution

  • The solution is like below,

    import fsspec
    import geopandas as gpd
    
    with fsspec.open(feather_file) as f
      gdf = gpd.read_feather(f)
    

    If you want to access feather file in s3 bucket, you need to open the file by fsspec and try to read file by geopandas.read_feather.

    You can find more reference in https://geopandas.org/en/stable/docs/user_guide/io.html