Search code examples
python-3.xpysparkrdd

A quick way to get the mean of each position in large RDD


I have a large RDD (more than 1,000,000 lines), while each line has four elements A,B,C,D in a tuple. A head scan of the RDD looks like

[(492,3440,4215,794),
(6507,6163,2196,1332),
(7561,124,8558,3975),
(423,1190,2619,9823)]

Now I want to find the mean of each position in this RDD. For example for the data above I need an output list has values:

(492+6507+7561+423)/4
(3440+6163+124+1190)/4
(4215+2196+8558+2619)/4
(794+1332+3975+9823)/4

which is:

[(3745.75,2729.25,4397.0,3981.0)]

Since the RDD is very large, it is not convenient to calculate the sum of each position and then divide by the length of RDD. Are there any quick way for me to get the output? Thank you very much.


Solution

  • I don't think there is anything faster than calculating the mean (or sum) for each column
    If you are using the DataFrame API you can simply aggregate multiple columns:

    import os
    import time
    
    from pyspark.sql import functions as f
    from pyspark.sql import SparkSession
    
    # start local spark session
    spark = SparkSession.builder.getOrCreate()
    
    # load as rdd
    def localpath(path):
        return 'file://' + os.path.join(os.path.abspath(os.path.curdir), path)
    
    rdd = spark._sc.textFile(localpath('myPosts/'))
    
    # create data frame from rdd
    df = spark.createDataFrame(rdd)
    means_df = df.agg(*[f.avg(c) for c in df.columns])
    means_dict = means_df.first().asDict()
    print(means_dict)
    

    Note that the dictionary keys will be the default spark column names ('0', '1', ...). If you want more speaking column names you can give them as an argument to the createDataFrame command