Search code examples
pysparkapache-spark-sqlrdd

Rdd with tuples of different size to dataframe


Using pyspark map-reduce methos i created an rdd. I now want to create a dataframe from this rdd. The rdd looking like this:

(491023, ((9,), (0.07971896408231094,), 'Debt collection'))
(491023, ((2, 14, 77, 22, 6, 3, 39, 7, 0, 1, 35, 84, 10, 8, 32, 13), (0.017180308460902963, 0.02751921818456658, 0.011887861159888378, 0.00859908577494079, 0.007521091815230704, 0.006522044953782423, 0.01032297079810829, 0.018976833302472455, 0.007634289723749076, 0.003033975857850723, 0.018805184361326378, 0.011217892399539534, 0.05106916198426676, 0.007901136066759178, 0.008895262042995653, 0.006665649645210911), 'Debt collection'))
(491023, ((36, 12, 50, 40, 5, 23, 58, 76, 11, 7, 65, 0, 1, 66, 16, 99, 98, 45, 13), (0.007528732561416072, 0.017248902490279026, 0.008083896178333739, 0.008274896865005982, 0.0210032206108319, 0.02048387345320946, 0.010225319903418824, 0.017842961406992965, 0.012026753813481164, 0.005154201637708568, 0.008274127579967948, 0.0168843021403551, 0.007416385430301767, 0.009257236955148311, 0.00590385362565239, 0.011031745337733267, 0.011076277004617665, 0.01575522984526745, 0.005431270081282964), 'Vehicle loan or lease'))

As you can see in my dataframe i will must have 4 different columns. The first one should be the Int 491023, the second a tuple (i think dataframes don't have tuple type, so array also works), third another tuple and fourth a string. As you can see my tuples have different sizes. The simplest command rdd.toDF() don't work for me. Any ideas how can i achieve that?


Solution

  • You can create your dataframe like below , eventually you can pass an array(ArrayType())/list

    from pyspark.sql import functions as F
    df_a = spark.createDataFrame([('N110WA',['12','34'],1590038340000)],[ "reg","val1","val2"])
    

    Output

    +------+--------+-------------+
    |   reg|    val1|         val2|
    +------+--------+-------------+
    |N110WA|[12, 34]|1590038340000|
    +------+--------+-------------+
    

    Schema

    df_a.printSchema()
    root
     |-- reg: string (nullable = true)
     |-- val1: struct (nullable = true)
     |    |-- _1: string (nullable = true)
     |    |-- _2: string (nullable = true)
     |-- val2: long (nullable = true)