The below produces random numbers that differ per row as expected. So, fine. But I am apparently missing some basic aspect apparently in thinking.
from pyspark.sql import functions as F
df = spark.range(10).withColumn("randomNum",F.rand())
df.show(truncate=False)
returning:
+---+-------------------+
|id |randomNum |
+---+-------------------+
|0 |0.8128581612050234 |
|1 |0.40656852491856355|
|2 |0.9444869347865689 |
|3 |0.10391423680687417|
|4 |0.05285485891027453|
|5 |0.5140906081158558 |
|6 |0.900727341820192 |
|7 |0.11046600268909801|
|8 |0.6509183512961298 |
|9 |0.5060097759646045 |
+---+-------------------+
Then invoking show()
again - the special Action below, why do we get the same random number sequence again as per first time round? Is show()
overriding the traditional Action approach on some aspects as it sees that it is the same DF? If so, that is not true for all methods in pyspark, etc. I am running this in 2 cells in a Databricks Notebook.
Looking at the SPARK UI, it uses the same seed twice. Why? Deterministic aspect, seems to be at odds with the concept of Action as we are taught.
df.show(truncate=False)
This is the result of random being initialized only once per partition. Source.
Hence, if partition layout is not changed, that same initial seed is being used in every consecutive execution, and thus rand
generates the same sequence.
When partitions are NOT fixed, rand
's behavior becomes non-deterministic, which is documented (in the source, at least) through Spark JIRA-13380.