Get sum and length of rdd column using groupBy?

I have the following RDD:

[(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)]

My expected RDD is:

[(1,[300, 1]), (2,[10, 3]), (4,[80,2])]

The first value in the list within the tuple is the sum(e.g. for 2: its 2+3+5 = 10) and second value is the no. of occurrences (e.g. 1 occurs once). Can the expected RDD be achieved using groupBy function?

Solution

You can map each value to a list [x, 1], then sum all the lists for each key.

rdd = sc.parallelize([(1, 300), (4, 60), (4, 20), (2, 2), (2, 3), (2, 5)])

result = rdd.mapValues(lambda x: [x, 1]).reduceByKey(lambda x, y: [x[0] + y[0], x[1] + y[1]])

result.collect()
# [(1, [300, 1]), (2, [10, 3]), (4, [80, 2])]

How to pick just one item from a generator?
Python: Get unbound class method
global frame vs. stack frame
How to generate a snapshot of a field in a time step with VTK and Python
How to read the first letter from the last line in a txt file in python
How to control scientific notation in matplotlib?
Streamlit multiselect, if I don't select anything, doesn't show data frame
How to extend a class in python?
Is there a standard location to store function cache files in Python?
C++ function (Vectors) wrapped with Cython being around 4 times slower than equivalent Cython function (NumPy Arrays MemoryViews), with large arrays
Error in anyjson setup command: use_2to3 is invalid
Send paid media aiogram 3.10
Is there a workaround for adding Microsoft Word footnotes dynamically in Python?
Training a Keras model to identify leap years
Overload a method based on init variables
How do I create a constant in Python?
What is gettext_lazy on django for?
Pydantic - parse a list of objects from YAML configuration file
How to print stdout excerpt in IPython
What is the difference between Spyder and Jupyter?
How do I create a multiline plot using seaborn?
How to read the request body using orjson library in FastAPI?
Does iPython have built-in support for viewing a variable in pager?
cropping the image by removing the white spaces
Verbose level with argparse and multiple -v options
How to return data in JSON format using FastAPI?
Rounding a rational number to the nearest integer, with half-up
Python inspector ignores property return hint when using TypeVar
How to highlight values per column in Polars
Create arbitrary multidimensional zeros array