Search code examples
apache-sparkapache-spark-sqlparquetaws-glueamazon-athena

why is my glue table creating with the wrong path?


I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:

spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")

The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location

s3://bucket_name/databases/my_db/my_perfectly_created_table

there is exactly one table that creates itself with this location:

s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__

I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:

s3://bucket_name/databases/my_db/my_problematic_table

so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).

Have any of you guys dealt with this before?


Solution

  • Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:

    Input Format    org.apache.hadoop.mapred.SequenceFileInputFormat
    Output Format   org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
    Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
    

    while my other tables had the config specific for parquet:

    Input Format    org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
    Output Format   org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
    Serde serialization library     org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe
    

    I tried to create the table forcing the config for parquet with the following command:

    val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
    val my_table = spark.read.format("parquet").load(path)
    val ddlSchema = my_table.toDF.schema.toDDL
    
        spark.sql(s"""
          |CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
          |ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
          |STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
          |OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
          |LOCATION '$path'
          |""".stripMargin
        )
    

    but it threw the following error:

    org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column

    so the problem was the naming of those columns, "1", "2" & "3" within that struct.

    Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.

    Hope this helps anyone