Search code examples
apache-sparkapache-spark-datasetspark-csv

Add UUID to spark dataset


I am trying to add a UUID column to my dataset.

getDataset(Transaction.class)).withColumn("uniqueId", functions.lit(UUID.randomUUID().toString())).show(false);

But the result is all the rows have the same UUID. How can i make it unique?

+-----------------------------------+
uniqueId                            |
+----------------+-------+-----------
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
|1abdecf-8303-4a4e-8ad3-89c190957c3b|
----------+----------------+--------+

Solution

  • Updated (Apr 2021):

    Per @ferdyh, there's a better way using the uuid() function from Spark SQL. Something like expr("uuid()") will use Spark's native UUID generator, which should be much faster and cleaner to implement.

    Originally (June 2018):

    When you include UUID as a lit column, you're doing the same as including a string literal.

    UUID needs to be generated for each row. You could do this with a UDF, however this can cause problems as UDFs are expected to be deterministic, and expecting randomness from them can cause issues when caching or regeneration happen.

    Your best bet may be generating a column with the Spark function rand and using UUID.nameUUIDFromBytes to convert that to a UUID.

    Originally, I had:

    val uuid = udf(() => java.util.UUID.randomUUID().toString)
    getDataset(Transaction.class).withColumn("uniqueId", uuid()).show(false);
    

    which @irbull pointed out could be an issue.