Search code examples
apache-sparkapache-spark-sqlapache-drill

Use directories for partition pruning in Spark SQL


I have data files (json in this example but could also be avro) written in a directory structure like:

dataroot
+-- year=2015
    +-- month=06
        +-- day=01
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=02
            +-- data1.json
            +-- data2.json
            +-- data3.json
    +-- month=07
        +-- day=20
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=21
            +-- data1.json
            +-- data2.json
            +-- data3.json
        +-- day=22
            +-- data1.json
            +-- data2.json

Using spark-sql I create a temporary table:

CREATE TEMPORARY TABLE dataTable
USING org.apache.spark.sql.json
OPTIONS (
  path "dataroot/*"
)

Querying the table works well but I'm so far not able to use the directories for pruning.

Is there a way to register the directory structure as partitions (without using Hive) to avoid scanning the whole tree when I query? Say I want to compare data for the first day of every month and only read directories for these days.

With Apache Drill I can use directories as predicates during query time with dir0 etc. Is it possible to do something similar with Spark SQL?


Solution

  • As far as I know partitioning autodiscovery only works for parquet files in SparkSQL. See http://spark.apache.org/docs/latest/sql-programming-guide.html#partition-discovery