I'm working on a table in Hive with several billion rows and over a hundred columns.
I need to coalesce the first non zero value out of 100 columns. I'm able to do this but it involves many lines of code (one line for each column). I also have to create another column that does the same in reverse to find the last non zero value which would mean at least another 100. Each column has the same naming convention so balance0, balance1, balance2 etc.
I was wondering if there was a nicer way of doing this with fewer lines of code? I've searched the web and can find lots about coalescing values but I couldn't seem to find anything that would help to cut down the lines of coded needed for this.
A simplified version of the code I'm using is below:
SELECT urn
,COALESCE( IF( balance0 <> '0', balance0, NULL )
,IF( balance1 <> '0', balance1, NULL )
,IF( balance2 <> '0', balance2, NULL )
,IF( balance3 <> '0', balance3, NULL )
,IF( balance4 <> '0', balance4, NULL )
,IF( balance5 <> '0', balance5, NULL )
,IF( balance6 <> '0', balance6, NULL )
,IF( balance7 <> '0', balance7, NULL )
,IF( balance8 <> '0', balance8, NULL )
,IF( balance9 <> '0', balance9, NULL )
,IF( balance10 <> '0', balance10, NULL )
,IF( balance11 <> '0', balance11, NULL )
,IF( balance12 <> '0', balance12, NULL )
,IF( balance13 <> '0', balance13, NULL )
,IF( balance14 <> '0', balance14, NULL )
,IF( balance15 <> '0', balance15, NULL )
,IF( balance16 <> '0', balance16, NULL )
,IF( balance17 <> '0', balance17, NULL )
,IF( balance18 <> '0', balance18, NULL )
,IF( balance19 <> '0', balance19, NULL )
,IF( balance20 <> '0', balance20, NULL )
,IF( balanceX.... etc to balance100
)
AS first_positive_balance
FROM table_x;
Thank you very much in advance for any help!
For the case you describe in the question, I don't see many shortcuts.. You could write a custom UDF (genericUDF) that could work with any number of arguments, but you'd still have to specify all the columns when you invoke the UDF.
For the case in the comment (coalesce many elements of a struct) you could write a custom UDF that receives only the struct as a parameter. The hive struct is actually represented as Object[] so it would be easy to implement any function on the struct elements no matter how many there are.
Here's an example of a genericUDF that receives a list of structs as a parameter.