Search code examples
sqlhadoophiveuser-defined-functions

Select specific value from Hive array


I have a table in Hive structured with 3 columns as follows;

timestamp   UserID  OtherId    
2016-09-01  123     "101","222","321","987","393.1","090","467","863"
2016-09-01  124     "188","389","673","972","193","100","143","210"
2016-09-01  125     "888","120","482","594","393.2"
2016-09-01  126     "441","501","322","671","008","899"
2016-09-01  127     "004","700","393.4","761","467","356","643","578"
2016-09-01  128     "322","582","348"
2016-09-01  129     "029","393.8","126","187"

Where OtherID is an array.

I need to parse OtherID so that the resultant dataset is the following, since I am only interested in values which contain '393%'

timestamp   UserID  OtherId    
2016-09-01  123     393.1
2016-09-01  125     393.2
2016-09-01  127     393.4
2016-09-01  129     393.8

I have researched a ton of parse functions but it seems they're all intended to return the position of the value, or you need to specify the position of the value to return it. Both of these options do not work here because '3309%' can occur at any point in the array for any given row. There's also the fact that I need to incorporate the wildcard to allow for variations of my desired value.

Another option is explode but my table is simply too large for that option.

I'm thinking a UDF might be the only way to go but would welcome some guidance there.

Grateful for any assistance.


Solution

  • It's easy do what you need using the lateral view option available in hive.

    0: jdbc:hive2://quickstart:10000/default> select * from test_5; 
    +-----------+------------+----------------------------------------------+
    | test_5.t  | test_5.id  |                  test_5.oid                  |
    +-----------+------------+----------------------------------------------+
    | 123       | 123        | "222","321","987","393.1","090","467","863"  |
    +-----------+------------+----------------------------------------------+
    

    And this is the trick:

    SELECT id, ooid
    FROM test_5 
    LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
    
    +------+----------+
    |  id  |   ooid   |
    +------+----------+
    | 123  | "222"    |
    | 123  | "321"    |
    | 123  | "987"    |
    | 123  | "393.1"  |
    | 123  | "090"    |
    | 123  | "467"    |
    | 123  | "863"    |
    +------+----------+
    

    Ergo:

    SELECT id, regexp_replace(ooid,'"','')
    FROM test_5 
    LATERAL VIEW EXPLODE(SPLIT(oid,",")) temp AS ooid;
    WHERE ooid LIKE '"393%';
    
    +------+----------+
    |  id  |   ooid   |
    +------+----------+
    | 123  |  393.1   |
    +------+----------+