Search code examples
apache-sparkhiveapache-spark-sqlparquet

Query data in subdirectories in Hive Partitions using Spark SQL


How can I force spark sql to recursively get data stored in parquet format from subdirectories ? In Hive, I could achieve this by setting few Hive configs.

set hive.input.dir.recursive=true;
set hive.mapred.supports.subdirectories=true;
set hive.supports.subdirectories=true;
set mapred.input.dir.recursive=true;

I tried to set these configs through spark sql queries but I get 0 records all the times compared to hive which get me the expected results. I also put these confs in hive-site.xml file but nothing changed. How can I handle this issue ?

Spark Version : 2.1.0 I used Hive 2.1.1 on emr-5.3.1

By the way, this issue one appears while using parquet files while with JSON it works fine.


Solution

  • One solution for this problem is to force spark to Hive Parquet reader by using hive context which would make spark able to read files recursively.