I have a Dataframe like below
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| K| 2| 0.09523809523809523|
| G| 2| 0.09523809523809523|
| L| 1|0.047619047619047616|
| D| 1|0.047619047619047616|
| F| 1|0.047619047619047616|
+----+-----+--------------------+
i want to select only top 75% of recordes from given dataframe in pyspark.i.e.
+----+-----+--------------------+
|test|count| support|
+----+-----+--------------------+
| A| 5| 0.23809523809523808|
| B| 5| 0.23809523809523808|
| C| 4| 0.19047619047619047|
| K| 2| 0.09523809523809523|
| G| 2| 0.09523809523809523|
| L| 1|0.047619047619047616|
+----+-----+--------------------+
You could calculate the size of the dataframe multiply it by 0.75
and use the limit
function. It would look like this:
df75 = df.limit(int(df.count() * 0.75))