Search code examples
apache-sparkpysparkrdd

pyspark - retrieve first element of rdd - top(1) vs. first()


I have to retrieve the element that satisfies the condition 1. from my rdd:

[((4, 2), (6, 3), (2, 1)),
((-3, 4), (2, 1)),
((4, 2), (-3, 4)),
((2, 1), (-3, 4)),
((6, 3), (-3, 4)),
((2, 1), (6, 3), (4, 2)),
((-3, 4), (4, 2)),
((4, 2), (2, 1), (6, 3)),
((-3, 4), (6, 3))]

The result needs to match

[((2,1),(6,3),(4,2))]

I thought I could use rdd.top(1) or rdd.first(),

as top(n) sorts and then retrieves I will not get to my desired Element with it.

rdd.first() gives me ...

[(4, 2), (6, 3), (2, 1)]

(A. can you explain the reason for the different results?) Solved.

B. can you help me retrieve the needed result? It needs to be an rdd, the order does not matter. However, the brackets/structure do.


Solution

  • As far as I got - You just need the first element from the RDD.

    This can be achieved using RDD.take(1) - But this will return a list, and not an RDD.

    RDD.take(1)
    # [((2, 1), (4, 2), (6, 3))]
    

    However, if you want the first element as an RDD, you can parallelize it

    frst_element_rdd = spark.sparkContext.parallelize(RDD.take(1))
    

    I haven't seen anything else that does this. So let me know if someone has something better.