python apache-spark pyspark apache-spark-sql label-encoding

Rank does not go in order if the value does not change

I have a dataframe:

data = [['p1',  't1'],  ['p4',  't2'], ['p2', 't1'],['p4', 't3'],
       ['p4', 't3'],   ['p3', 't1'],]
sdf = spark.createDataFrame(data, schema = ['id', 'text'])
sdf.show()
+---+----+
| id|text|
+---+----+
| p1|  t1|
| p4|  t2|
| p2|  t1|
| p4|  t3|
| p4|  t3|
| p3|  t1|
+---+----+

I want to group by text. If the text does not change, then the rank remains. For example

+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   1|
| p2|  t1|   1|
| p3|  t1|   1|
| p4|  t2|   2|
| p4|  t3|   3|
| p4|  t3|   3|
+---+----+----

Unfortunately, the rank function does not give what I need.

w = Window.partitionBy("text").orderBy("id")
sdf2 = sdf.withColumn("rank", F.rank().over(w))
sdf2.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   1|
| p2|  t1|   2|
| p3|  t1|   3|
| p4|  t2|   1|
| p4|  t3|   1|
| p4|  t3|   1|
+---+----+----+

Solution

It seems you are not looking to rank your observations within a group, but to convert a categorical variable into numeric. You can do this with StringIndexer:

from pyspark.ml.feature import StringIndexer

indexer = StringIndexer(inputCol='text', outputCol='rank', stringOrderType='alphabetAsc')
indexer_fitted = indexer.fit(sdf)
sdf = indexer_fitted.transform(sdf)

sdf = sdf.withColumn('rank', F.col('rank').cast('int'))
sdf.show()
+---+----+----+
| id|text|rank|
+---+----+----+
| p1|  t1|   0|
| p2|  t1|   0|
| p3|  t1|   0|
| p4|  t2|   1|
| p4|  t3|   2|
| p4|  t3|   2|
+---+----+----+