Search code examples
sqlhadoophiveclouderaimpala

Create a date-limited view on a hive table containing complex types in a way that is queryable with Impala?


I have a very large parquet table containing nested complex types such as structs and arrays. I have partitioned it by date and would like to restrict certain users to, say, the latest week of data.

The usual way of doing this would be to create a time-limited view on top of the table, e.g.:

''' CREATE VIEW time_limited_view AS SELECT * FROM my_table WHERE partition_date >= '2020-01-01' '''

This will work fine when querying the view in Hive. However, if I try to query this view from Impala, I get an error:

** AnalysisException: Expr 'my_table.struct_column' in select list returns a complex type **

The reason for this is that Impala does not allow complex types in the select list. Any view I build which selects the complex columns will cause errors like this. If I flatten/unnest the complex types, this would of course get around this issue. However due to the layers of nesting involved I would like to keep the table structure as is.

I see another suggested workaround has been to use Ranger row-level filtering but I do not have Ranger and will not be able to install it on the cluster. Any suggestions on Hive/Impala SQL workarounds would be appreciated


Solution

  • While working on a different problem I came across a kind of solution that fits my needs (but is by no means a general solution). I figured I'd post it in case anyone has similar needs.

    Rather than using a view, I can simply use an external table. So firstly I would create a table in database_1 using Hive, which has a corresponding location, location_1, in hdfs. This is my "production" database/table which I use for ETL and contains a very large amount of data. Only certain users have access to this database.

    CREATE TABLE database_1.tablename
    (`col_1` BIGINT,
    `col_2` array<STRUCT<X:INT, Y:STRING>>)
    PARTITIONED BY (`date_col` STRING)
    STORED AS PARQUET
    LOCATION 'location_1';
    

    Next, I create a second, external table in the same location in hdfs. However this table is stored in a database with a much broader user group (database_2).

    CREATE EXTERNAL TABLE database_2.tablename
    (`col_1` BIGINT,
    `col_2` array<STRUCT<X:INT, Y:STRING>>)
    PARTITIONED BY (`date_col` STRING)
    STORED AS PARQUET
    LOCATION 'location_1';
    

    Since this is an external table, I can add/drop date partitions at will without affecting the underlying data. I can add 1 weeks' worth of date partitions to the metastore and as far as end users can tell, that's all that is available in the table. I can even make this part of my ETL job, where each time new data is added, I add that partition to the external table and then drop a partition from a week ago, resulting in this rolling window of 1 weeks' data being made available to this user group without having to duplicate a load of data to a separate location.

    This is by no means a row-filtering solution, but is a handy way to use partitions to expose a subset of data to a broader user group without having to duplicate that data in a separate location.