Search code examples
sqlapache-sparkpysparkhive

Performance of partitioned view of an unpartitioned table


I have an unpartitioned hive table which is used to create a partitioned view. The table has just some metadata columns and the actual data is stored in an array which makes querying difficult. Hence the data is exploded into a view which then is used for all querying purposes. This view is partitioned on the date the data arrives. In this scenario, will the performance be affected as the original table is unpartitioned? Should the original table be partitioned too?


Solution

  • If underlying tables are not partitioned, view partitioning is not useful at all.

    Of course, table should be partitioned if you want partition pruning to work, otherwise the full-scan will be performed.

    On the other hand, if the table is partitioned and a view is not partitioned and query has predicates on table partitions, optimizer is clever enough to derive partition info from the view definition, push predicates down and partition pruning works. And this makes view partitioning rather useless feature and manually managed view partitions add unnecessary complication. Better use partitioned tables and normal not partitioned views.

    Why you may need partitioned view if partition pruning works with non-partitioned view. One possible use-case is when restricted user can see only the view, not underlying tables and different tools can derive partition information from view metadata only and suggest filtering on partitions, knowing nothing about underlying tables. From the restricted user perspective, view is the same as table and they should see partitioning schema.

    See HIVE-1079:

    For the manual approach, a very simple thing we can start with is just letting users add partitions to views explicitly as a way of indicating that the underlying data is ready.

    See also HIVE-1941 and this design document

    BTW Materialized views are already implemented in Hive 3.0.0 and prtitioning of them makes much more sense because the data in them are stored accordingly to partition schema specified in DDL.