python apache-spark pyspark spark-graphx

How to create pair RDD with elements that share keys in source RDD?

I have a key-value RDD in pyspark and would like to return an RDD of pairs that have the same key in the source RDD.

#input rdd of id and user
rdd1 = sc.parallelize([(1, "user1"), (1, "user2"), (2, "user1"), (2, "user3"), (3,"user2"), (3,"user4"), (3,"user1")])

#desired output
[("user1","user2"),("user1","user3"),("user1","user4"),("user2","user4")]

So far I have been unable to come up with the correct combination of functions to do this. The purpose of this is to create an edge list of users based off of a shared common key.

Solution

As far as I understand your description something like this should work:

output = (rdd1
   .groupByKey()
   .mapValues(set)
   .flatMap(lambda kvs: [(x, y) for x in kvs[1] for y in kvs[1] if x < y])
   .distinct())

Unfortunately it is rather expensive operation.

Display only part of Y-axis on Bokeh
Python Polars join on column with greater or equal
Python - progress bar for copying and moving files
PartialDependenceDisplay.from_estimator plots having lines with 0 values
Cannot import statsmodel module because it doesn't show it's installed
Import "rest_framework" could not be resolved. But I have installed djangorestframework, I don't know what is going wrong
What does a "version file" look like?
Passing websocket data to another python program
Way to iterate two items at a time in a list?
What are the arguments for the ElementClickInterceptedException in Selenium?
Python I2C LCD Driver displays weird characters
Non-equi join in polars
Combine cross between 2 dataframe efficiently
Java equivalent of function mapping in Python
Keep only rows that have at least one null
How can I cleanly normalize data and then denormalize it later?
byte reverse AB CD to CD AB with python
Proper way to handle multiple forms on one page in Django
Apply styles while exporting to 'xlsx' in pandas with XlsxWriter
Getting today's date in YYYY-MM-DD in Python?
virtualenv activate does not work
Using np.argpartition to index values in a multidimensional array
Airflow EMRServerlessCreateApplicationOpertor can't detect application name from airflow input parameter using jinja template
Interactive Brokers Python Multiple Symbol Request
Python type hint: Intersection of types (class implementing interface)
3.11 Lab: Smallest number
Already use pd.concat but still trigger "PerformanceWarning: Dataframe is highly fragmented"
Use FastAPI to parse incoming POST request from Slack
How to unpad PKCS#7 / PKCS#5 padding?
Python/Json:Expecting property name enclosed in double quotes