apache-spark apache-spark-sql parquet aws-glue amazon-athena

why is my glue table creating with the wrong path?

I'm creating a table in AWS Glue using a spark job orchestrated by Airflow, it reads from a json and writes a table, the command I use within the job is the following:

spark.sql(s"CREATE TABLE IF NOT EXISTS $database.$table using PARQUET LOCATION '$path'")

The odd thing here is that I have other tables created using the same job (with different names) but they are created without problems, e.g. they have the location

s3://bucket_name/databases/my_db/my_perfectly_created_table

there is exactly one table that creates itself with this location:

s3://bucket_name/databases/my_db/my_problematic_table-__PLACEHOLDER__

I don't know where that -__PLACEHOLDER__ is coming from. I already tried deleting the table and recreating it but it always does the same thing on this exact table. The data is in parquet format in the path:

s3://bucket_name/databases/my_db/my_problematic_table

so I know the problem is just creating the table correctly because all I get is a col (array<string>) when trying to query it in Athena (as there is no data in /my_problematic_table-__PLACEHOLDER__).

Have any of you guys dealt with this before?

Solution

Upon closer inspection in AWS glue, this specific problematic_table had the following config, specific for CSV files and custom-delimiters:

Input Format    org.apache.hadoop.mapred.SequenceFileInputFormat
Output Format   org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
Serde serialization library org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

while my other tables had the config specific for parquet:

Input Format    org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat
Output Format   org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat
Serde serialization library     org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe

I tried to create the table forcing the config for parquet with the following command:

val path = "s3://bucket_name/databases/my_db/my_problematic_table/"
val my_table = spark.read.format("parquet").load(path)
val ddlSchema = my_table.toDF.schema.toDDL

    spark.sql(s"""
      |CREATE TABLE IF NOT EXISTS my_db.manual_myproblematic_table($ddlSchema)
      |ROW FORMAT SERDE 'org.apache.hadoop.hive.ql.io.parquet.serde.ParquetHiveSerDe'
      |STORED AS INPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetInputFormat'
      |OUTPUTFORMAT 'org.apache.hadoop.hive.ql.io.parquet.MapredParquetOutputFormat'
      |LOCATION '$path'
      |""".stripMargin
    )

but it threw the following error:

org.apache.spark.SparkException: Cannot recognize hive type string: struct<1:string,2:string,3:string>, column: problematic_column

so the problem was the naming of those columns, "1", "2" & "3" within that struct.

Given that this struct did not contain valuable info I ended up dropping it and creating the table again. now it works like a charm and it has the correct (parquet) config in glue.

Hope this helps anyone