Search code examples
amazon-s3parquetamazon-athena

Reading Parquet files in s3 with Athena


The goal is to merge multiple parquet files into a single Athena table so that I can query them.

I converted two parquet files from csv:

pandas.read_csv('a.csv').to_parquet('a.parquet', index=False)
pandas.read_csv('b.csv').to_parquet('b.parquet', index=False)

The CSV has the format id,name,age, for example:

1,john,20
2,mark,25

I upload these to the S3 bucket: s3://my-test-bucket, which is arranged as:

my-test-bucket
  -> folder1
    -> a.parquet
  -> folder2
    -> b.parquet

Then created the table in Athena via:

CREATE EXTERNAL TABLE `my_table`(
  `id` int, 
  `name` string, 
  `age` int
)
STORED AS PARQUET
LOCATION 's3://my-test-bucket'
tblproperties ("parquet.compress"="SNAPPY");

However, when i try to query the table via:

SELECT * FROM my_table;

I get empty rows as my return. Is there something that I am missing?


Solution

  • A possible workaround is by using AWS Data Wrangler:

    import awswrangler as wr
    
    path = 's3://my-test-bucket'
    
    # Assume the data has been extracted in dict form from .csv
    df = pd.DataFrame({
        "id": [1, 2],
        "name": ["John", "Jane"],
        "age": [10, 11],
        "folder": [1, 2]
    })
    
    wr.s3.to_parquet(
        df=df,
        path=path,
        dataset=True,
        mode="overwrite",
        partition_cols=["folder"]
    )
    

    This will generate in s3:

    my-test-bucket
      -> folder=1
        -> <some hash>.parquet
      -> folder=2
        -> <some hash>.parquet