Search code examples
apache-sparkparquetspark-structured-streaming

Is it possible to change location of _spark_metadata folder in spark structured streaming?


val query = df.withColumn("value", col("value").cast(StringType))
      .withColumn("value", from_json(col("value"), processor.Schema))
      .select(unix_timestamp(col("timestamp")).alias("kafka_time"), col("value.*"))
      .filter(processor.filter)
      .transform(processor.transform)
      .writeStream
      .format("parquet")
      .partitionBy("grass_date")
      .option("path", config.savePath)
      .option("checkpointLocation", config.checkpointLocation)
      .trigger(Trigger.ProcessingTime("15 minutes"))
      .outputMode(OutputMode.Append)
      .start()

While running a structured streaming job with parquet file sink, spark creates a _spark_metadata folder under the job's write path. Due to this folder, partition discovery seems not working. So, is it possible to get rid of this _spark_metadata folder or may be changing the location of it?

Edit 1: I am using spark 2.4.4

Edit 2: I can create a hive table on config.savePath. But can't see any data in that table. Here is what I have under savePath.

[xxx]$ hadoop fs -ls /tmp/ravi.mondal/product_click/remind_me_button
Found 2 items
drwxrwxr-x   - ravi.mondal supergroup          0 2020-05-20 12:36 /tmp/ravi.mondal/product_click/remind_me_button/_spark_metadata
drwxrwxr-x   - ravi.mondal supergroup          0 2020-05-20 12:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20
[xxx]$
[xxx]$ hadoop fs -ls /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20
Found 27 items
-rw-rw-r--   3 ravi.mondal supergroup       1575 2020-05-20 12:46 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00009-34ec06fb-4506-4e73-963b-4441bd00410d.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1798 2020-05-20 12:31 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00017-e0d550b4-225c-44d5-a539-1e4e38a1069e.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1681 2020-05-20 11:46 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00023-9caf4a09-6c99-482b-9212-f03513c80070.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1561 2020-05-20 12:32 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00028-493b6d84-9638-4428-a0c7-99252d2efcd5.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1737 2020-05-20 12:32 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00032-4a72a3f3-a221-4071-b4f5-a49d16aadbba.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1773 2020-05-20 12:47 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00036-dca34760-861f-45f8-8ce0-51feb5ac2768.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1539 2020-05-20 11:47 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00042-cc062316-2afd-49c2-9ad8-8709693b2986.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 12:47 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00048-9432d414-2aaa-424a-84b8-cd4364fa4e87.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1665 2020-05-20 12:17 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00049-a8c3f0f0-80f5-4690-a928-1f2108aa39df.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1656 2020-05-20 11:30 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00051-0c016684-cf71-4681-b1cd-fcb325452e89.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1825 2020-05-20 12:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00063-2dc3d00d-46ed-41cc-b189-2ed475ed5c5c.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 12:20 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00065-70b9e314-8292-4e48-81c4-e3b983977563.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1629 2020-05-20 12:50 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00065-bfed91f6-1398-4038-aee7-56cb0cf87414.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 12:18 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00074-4beb1880-2bc0-4001-9684-546e240b6888.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1665 2020-05-20 12:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00075-adbc8782-7b6f-4dbd-a8f8-e878648b1ff2.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 11:31 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00081-9e56a444-161f-43d8-9e50-bf24c6484d83.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1688 2020-05-20 11:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00081-f246df73-8db5-49f4-9682-9a12bdeb0b5a.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1656 2020-05-20 11:30 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00083-8e0fdecb-8d0b-49d5-8e93-6edeee1539fc.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1656 2020-05-20 12:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00092-b292d9ed-ce41-4426-833d-38f994af87d4.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1665 2020-05-20 12:05 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00105-59bf04c1-b79f-42f1-995d-f3673486886d.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1823 2020-05-20 12:05 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00108-00f0fc98-4e10-43c5-b5b3-9e0a10a7db03.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1737 2020-05-20 12:51 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00109-1389f070-e430-4246-95da-d2d4606b46ec.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1672 2020-05-20 12:20 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00109-b8e42728-ef8c-49d9-8451-aab55e3045cc.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 11:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00110-de37d04f-f26e-4a9c-872b-3b04ac8a188c.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1672 2020-05-20 11:49 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00112-b06ee506-04c1-4969-bf12-069a9a88f222.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1825 2020-05-20 12:51 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00119-f36fc943-3f97-47ff-9502-a4dbcd69b591.c000.snappy.parquet
-rw-rw-r--   3 ravi.mondal supergroup       1584 2020-05-20 12:05 /tmp/ravi.mondal/product_click/remind_me_button/grass_date=2020-05-20/part-00124-bcfcbd1c-15aa-4410-b016-719715c8e775.c000.snappy.parquet

Solution

  • By looking at spark source code there is noway to change path of _spark_metadata directory, For your reference I have added git repo code where they are creating this directory & this directory is creating inside specified path.

    FileStreamSink Source Code