Search code examples

why is my glue table creating with the wrong path?

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:

spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")

The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location


there is exactly one table that creates itself with this location:


I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:


so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).

Have any of you guys dealt with this before?


  • Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:

    Input Format    org.apache.hadoop.mapred.SequenceFileInputFormat
    Output Format
    Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

    while my other tables had the config specific for parquet:

    Input Format
    Output Format
    Serde serialization library

    I tried to create the table forcing the config for parquet with the following command:

    val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
    val my_table ="parquet").load(path)
    val ddlSchema = my_table.toDF.schema.toDDL
          |CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
          |ROW FORMAT SERDE ''
          |OUTPUTFORMAT ''
          |LOCATION '$path'

    but it threw the following error:

    org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column

    so the problem was the naming of those columns, "1", "2" & "3" within that struct.

    Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.

    Hope this helps anyone