Search code examples
pythonhadoophiveudf

How to query multiple columns when using a python udf in hive?


I am trying to execute this query:

add FILE /home/user1/test/test_udf.py;

SELECT a.hash_code, col2
FROM (SELECT transform (col2, col3) using 'python test_udf.py' as hash_code, col2
      FROM sample_table) a ;

I am able to successfully generate the hash_code using the udf but the other column (col2) is getting populated as NULL.

Sample Output:

sjhfshhalksjlkfj128798172jasjhas   NULL
ajsdlkja982988290819189089089889   NULL
jhsad817982mnsandkjsahj982398290   NULL

Solution

  • I know what's wrong with your HiveSql.

    In transform (col2, col3) using 'python test_udf.py' as hash_code, col2 FROM sample_table, the hash_code, col2's value is parsed from transform (col2, col3)'s return value.

    The clo2 is parsed from transform (col2, col3), which is NULL.

    I read the Transform doc, picked up the related info as below.

    Transform/Map-Reduce Syntax

    SELECT TRANSFORM '(' expression (',' expression)* ')'
      (inRowFormat)?
      USING 'my_reduce_script'
      ( AS colName (',' colName)* )?
      (outRowFormat)? (outRecordReader)?
    

    You'd better not mix transform with other select, as the syntax won't support.

    Update:

    There is a hack to do what you want: let the test_udf.py return hash_code\t col2. So you can parse hash_code, col2 from it. This would solve your problem.