Search code examples
sqlclouderaimpalahue

Impala last_value Returning Inconsistent Result in View


I have some data formatted in the following manner and I am having issues creating an Impala view with the correct output. Lets call this table h_test_daily:

+--------+-----------+-------+
| Tag_ID | Date      | Value |
+--------+-----------+-------+
| 1      | 7/10/2017 | 10    |
| 1      | 7/10/2017 |       |
| 1      | 7/9/2017  | 8     |
| 2      | 7/10/2017 | 20    |
| 2      | 7/10/2017 | 16    |
| 2      | 7/9/2017  | 9     |
| 3      | 7/10/2017 |       |
| 3      | 7/10/2017 | 10    |
| 3      | 7/9/2017  | 5     |
| 4      | 7/10/2017 | 13    |
| 4      | 7/10/2017 |       |
| 4      | 7/9/2017  | 18    |
+--------+-----------+-------+

I am trying to perform a last_value() aggregate on it using Impala (On a HUE server). Now performing the following SQL works fine:

SELECT DISTINCT Tag_ID, `Date`, Value FROM
(SELECT Tag_ID, 
`Date`,
last_value(Value IGNORE NULLS) over (PARTITION BY Tag_ID, `Date` 
ORDER BY `Date` ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as Value
FROM test.h_test_daily)
TAB
;

The output from this is correct. I get the last value for each Tag, for each date.

+--------+-----------+----------+
| Tag_ID |     Date  |    Value |
+--------+-----------+----------+
|      3 | 7/10/2017 |       10 |
|      4 | 7/10/2017 |       13 |
|      1 | 7/10/2017 |       10 |
|      1 | 7/9/2017  |        8 |
|      4 | 7/9/2017  |       18 |
|      2 | 7/9/2017  |        9 |
|      3 | 7/9/2017  |        5 |
|      2 | 7/10/2017 |       16 |
+--------+-----------+----------+

So when I try to create a View using that SQL it does something weird. I start to see null values. Where it is getting the last value and now looking to see if it is null or not. This problem DOES NOT happen with a table. I am hoping I have done something wrong and that creating the view only shows it.

Here is the output of the view:

+--------+-----------+----------+
| Tag_ID |     Date  |    Value |
+--------+-----------+----------+
|      3 | 7/10/2017 | 10       |
|      1 | 7/9/2017  | 8        |
|      4 | 7/9/2017  | 18       |
|      4 | 7/10/2017 | NULL     |
|      1 | 7/10/2017 | NULL     |
|      2 | 7/9/2017  | 9        |
|      3 | 7/9/2017  | 5        |
|      2 | 7/10/2017 | 16       |
+--------+-----------+----------+

EDIT: I have made a whole bunch of changes and I am still encountering the same issue. Here is what I have tried:

Added column named mins

As Gordon pointed out, my ordering was wrong. Each partition had the same order-by key. The table is data from a few sensors, it is updated every hour with an entire hour of data. So I added a mins column for each row. (This is unique when combined with Tag_ID)

Here is the updated table and SQL

+--------+-----------+-------+------+
| Tag_ID |   Date    | Value | mins |
+--------+-----------+-------+------+
|      1 | 7/9/2017  |     8 |   10 |
|      1 | 7/10/2017 |    10 |   10 |
|      1 | 7/10/2017 |    13 |   11 |
|      1 | 7/10/2017 |       |   12 |
|      2 | 7/9/2017  |     9 |   10 |
|      2 | 7/10/2017 |    15 |   10 |
|      2 | 7/10/2017 |    16 |   11 |
|      3 | 7/9/2017  |     5 |   10 |
|      3 | 7/10/2017 |       |   10 |
|      3 | 7/10/2017 |    10 |   11 |
|      4 | 7/9/2017  |    18 |   10 |
|      4 | 7/10/2017 |    13 |   10 |
|      4 | 7/10/2017 |       |   11 |
+--------+-----------+-------+------+

Now for the select statement and create view I ran into the same issues. Here is the updated SQL statement:

SELECT DISTINCT Tag_ID, `Date`, Value FROM
(SELECT Tag_ID, 
`Date`,
last_value(Value ignore nulls) over (PARTITION BY Tag_ID, `Date` ORDER BY 
mins ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING) as Value
FROM test.h_test_daily)
TAB
;

I then tried to filter out the null values in the sub-select, so I added a WHERE Value IS NOT NULL. The problem with this, in the actual data set, there are about 40 different Value columns giving different information. So that column would get removed from the query even though one of the other values is not null.

FINAL EDIT: I contacted Cloudera's support team to clarify on this issue. They have gotten back and said this is an Impala issue. As they were not considering the ignore nulls argument on the create view. This issue is planned to be fixed in 2.10.0. Here is the Bug Thread: Link


Solution

  • This is a problem with neither the view nor the table. The problem is with your query. You have order by date, but you are partitioning by the date as well. So, the date is constant for the partition.

    Sorts are unstable in SQL. That means that when keys are the same, they can appear in any order -- and they do. Why? The reason is simple: SQL tables represent unordered sets. There is no default sort order.

    The fact is, the query doesn't work in either context. It just happens to look like it works in one case, but not the other. You need another key to make the sort stable, so the same previous value will always be chosen.