Search code examples
pysparkamazon-athenaaws-glue

create_dynamic_frame_from_catalog returning zero results


I'm trying to create a dynamic glue dataframe from an athena table but I keep getting an empty data frame.

  • The athena table is part of my glue data catalog

  • The create_dynamic_frame_method call doesn't raise any error. I tried loading a random table and it did complain just as a sanity check.

  • I know the Athena table has data, since querying the exact same table using Athena returns results

  • The table is an external json, partitioned table on s3

I'm using pyspark as shown below:

import sys
from pyspark.context import SparkContext
from awsglue.context import GlueContext

# Create a Glue context
glueContext = GlueContext(SparkContext.getOrCreate())

# Create a DynamicFrame using the 'raw_data' table
raw_data_df = 
glueContext.create_dynamic_frame.from_catalog(database="***", 
table_name="raw_***")

 # Print out information about this data, im getting zero here
 print "Count:  ", raw_data_df.count()

#also getting nothing here
raw_data_df.printSchema() 

Anyone facing the same issue ? Could this be a permissions issue or a glue bug since no errors are raised?


Solution

  • There are several poorly documented features/gotchas in Glue which is sometimes frustrating.

    I would suggest to investigate the following configurations of your Glue job:

    1. Does the S3 bucket name has aws-glue-* prefix?
    2. Put the files in S3 folder and make sure the crawler table definition is on folder rather than actual file.

    I have also written a blog on LinkedIn about other Glue gotchas if that helps.