Search code examples
dataframepysparkapache-spark-sqlcontinuousident

How can I add continuous 'Ident' column to a dataframe in Pyspark, not as monotonically_increasing_id()?


I have a dataframe 'df', and I want to add an 'Ident' numeric column where the values are continuous. I tried with monotonically_increasing_id() but the values are not continuous. As it description says: "The generated ID is guaranteed to be monotonically increasing and unique, but not consecutive. "

So, my question is, how could I do that?


Solution

  • You could try something like this,

    df = df.rdd.zipWithIndex().map(lambda x: [x[1]] + [y for y in x[0]]).toDF(['Ident']+df.columns)
    

    This will give you first column as your identifier which will have consecutive values starting from 0 to N-1, where N is total number of records in df.