Search code examples
hadoophashhiveidentifier

generating unique ids in hive


I have been trying to generate unique ids for each row of a table (30 million+ rows).

  • using sequential numbers obviously not does not work due to the parallel nature of Hadoop.
  • the built in UDFs rand() and hash(rand(),unixtime()) seem to generate collisions.

There has to be a simple way to generate row ids, and I was wondering of anyone has a solution.

  • my next step is just creating a Java map reduce job to generate a real hash string with a secure random + host IP + current time as a seed. but I figure I'd ask here before doing it ;)

Solution

  • Use the reflect UDF to generate UUIDs.

    reflect("java.util.UUID", "randomUUID")
    

    Update (2019)

    For a long time, UUIDs were your best bet for getting unique values in Hive. As of Hive 4.0, Hive offers a surrogate key UDF which you can use to generate unique values which will be far more performant than UUID strings. Documentation is a bit sparse still but here is one example:

    create table customer (
      id bigint default surrogate_key(),
      name string, 
      city string, 
      primary key (id) disable novalidate
    );
    

    To have Hive generate IDs for you, use a column list in the insert statement and don't mention the surrogate key column:

    -- staging_table would have two string columns.
    insert into customer (name, city) select * from staging_table;