Consistent Hive and Impala Hash?

I am looking for a consistent way to hash something in both the Hive Query Language and the Impala Query Language where the hashing function produce the same value regardless of if it is done in Hive or in Impala. To clarify, I want something like some_hive_hash_thing(A) = some_other_impala_hash_thing(A).

For Hive, I know there is hash() which uses MD5 (or any of the commands here). For Impala, I know there is fnv_hash() which uses the FNV algorithm. I know that Hive and Impala have their own hashing functions, but they are completely different from one another.

Ideally, I am looking for a way to do fnv_hash in Hive, or a way to do MD5 in Impala. Does anyone have any suggestions?

Solution

It's so late as an answer, but let's keep it here for someone else who may find it helpful.

"A way to do MD5 in Impala" yes there is and you can use UDFs built-in function of Hive in Impala in the recent releases (I'm using CDH 5.12 and it's working well with impala 2.9, and hive 1.1)

you can find here the list of the built-in functions https://cwiki.apache.org/confluence/display/Hive/LanguageManual+UDF

or you can simply run SHOW FUNCTIONS; in your hive console beeline -u jdbc:hive2://localhost:10000

so let's do a simulation of adding the MD5 function from hive to Impala.

DESCRIBE FUNCTION md5;

To make sure the function exists and to know the input and output variables type, so here we know that md5(string) takes a string as variable and a string as a return type .

Next we need to find hive-exec jar that contains our MD5 class using the Jar command :

/opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5

Jar command is usually in the /bin under your java repository if it's not already configured in your environment variables .

you can find hive-exec-X-X.jar file in ../lib/hive/lib/ , if you can't find it just use locate command

so the output is something like :

 /opt/jdk**/bin/jar tf hive-exec-*.*.*-cdh**.jar | grep Md5
   org/apache/hadoop/hive/ql/udf/UDFMd5.class

save that path for later but we'll replace the '/' by '.' and remove the '.class' like this : org.apache.hadoop.hive.ql.udf.UDFMd5

copy the jar file in a directory accessible by HDFS and you may rename it for a simple use 'Im gonna name it hive-exec.jar'.

cp /lib/hive/lib/hive-exec.jar  /opt/examples/

chown -R hdfs /opt/examples/

then create a place to put your jars in hdfs

sudo -u hdfs hadoop fs -mkdir /user/hive/warehouse/hive_jars

Copy your jar file to HDFS using :

sudo -u hdfs hadoop fs -copyFromLocal /opt/examples/hive-exec.jar /user/hive/warehouse/hive_jars/

so now you just have to go to Impala-shell and connect to a database then create your function using your HDFS path to the jar and the .class path we agreed earlier to convert in symbol.

   Impala-shell>use udfs;
 create function to_md5(string) returns string location '/user/hive/warehouse/hive_jars/hive-exec.jar' symbol='org.apache.hadoop.hive.ql.udf.UDFMd5';

here you go you can use it now like any Impala function :

select to_md5('test');
    | udfs.to_md5('test')             |
    +----------------------------------+
    | 098f6bcd4621d373cade4e832627b4f6 |

    show functions ;
    Query: show functions
    +-------------+----------------------+-------------+---------------+
    | return type | signature            | binary type | is persistent |
    +-------------+----------------------+-------------+---------------+
    | STRING      | to_md5(STRING)       | JAVA        | false         |