I have a dataframe:
data = [['p1', 't1'], ['p4', 't2'], ['p2', 't1'],['p4', 't3'],
['p4', 't3'], ['p3', 't1'],]
sdf = spark.createDataFrame(data, schema = ['id', 'text'])
sdf.show()
+---+----+
| id|text|
+---+----+
| p1| t1|
| p4| t2|
| p2| t1|
| p4| t3|
| p4| t3|
| p3| t1|
+---+----+
I want to group by text. If the text does not change, then the rank remains. For example
+---+----+----+
| id|text|rank|
+---+----+----+
| p1| t1| 1|
| p2| t1| 1|
| p3| t1| 1|
| p4| t2| 2|
| p4| t3| 3|
| p4| t3| 3|
+---+----+----
Unfortunately, the rank function does not give what I need.
w = Window.partitionBy("text").orderBy("id")
sdf2 = sdf.withColumn("rank", F.rank().over(w))
sdf2.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1| t1| 1|
| p2| t1| 2|
| p3| t1| 3|
| p4| t2| 1|
| p4| t3| 1|
| p4| t3| 1|
+---+----+----+
It seems you are not looking to rank your observations within a group, but to convert a categorical variable into numeric. You can do this with StringIndexer
:
from pyspark.ml.feature import StringIndexer
indexer = StringIndexer(inputCol='text', outputCol='rank', stringOrderType='alphabetAsc')
indexer_fitted = indexer.fit(sdf)
sdf = indexer_fitted.transform(sdf)
sdf = sdf.withColumn('rank', F.col('rank').cast('int'))
sdf.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1| t1| 0|
| p2| t1| 0|
| p3| t1| 0|
| p4| t2| 1|
| p4| t3| 2|
| p4| t3| 2|
+---+----+----+