Search code examples
python-2.7apache-sparkpysparkapache-spark-mllibapache-spark-sql

how to select only to 70% of recodes from dataframe in pyspark?


I have a Dataframe like below

+----+-----+--------------------+
|test|count|             support|
+----+-----+--------------------+
|   A|    5| 0.23809523809523808|
|   B|    5| 0.23809523809523808|
|   C|    4| 0.19047619047619047|
|   K|    2| 0.09523809523809523|
|   G|    2| 0.09523809523809523|
|   L|    1|0.047619047619047616|
|   D|    1|0.047619047619047616|
|   F|    1|0.047619047619047616|
+----+-----+--------------------+

i want to select only top 75% of recordes from given dataframe in pyspark.i.e.

+----+-----+--------------------+
|test|count|             support|
+----+-----+--------------------+
|   A|    5| 0.23809523809523808|
|   B|    5| 0.23809523809523808|
|   C|    4| 0.19047619047619047|
|   K|    2| 0.09523809523809523|
|   G|    2| 0.09523809523809523|
|   L|    1|0.047619047619047616|
+----+-----+--------------------+

Solution

  • You could calculate the size of the dataframe multiply it by 0.75 and use the limit function. It would look like this:

    df75 = df.limit(int(df.count() * 0.75))