Using pyspark map-reduce methos i created an rdd. I now want to create a dataframe from this rdd. The rdd looking like this:
(491023, ((9,), (0.07971896408231094,), 'Debt collection'))
(491023, ((2, 14, 77, 22, 6, 3, 39, 7, 0, 1, 35, 84, 10, 8, 32, 13), (0.017180308460902963, 0.02751921818456658, 0.011887861159888378, 0.00859908577494079, 0.007521091815230704, 0.006522044953782423, 0.01032297079810829, 0.018976833302472455, 0.007634289723749076, 0.003033975857850723, 0.018805184361326378, 0.011217892399539534, 0.05106916198426676, 0.007901136066759178, 0.008895262042995653, 0.006665649645210911), 'Debt collection'))
(491023, ((36, 12, 50, 40, 5, 23, 58, 76, 11, 7, 65, 0, 1, 66, 16, 99, 98, 45, 13), (0.007528732561416072, 0.017248902490279026, 0.008083896178333739, 0.008274896865005982, 0.0210032206108319, 0.02048387345320946, 0.010225319903418824, 0.017842961406992965, 0.012026753813481164, 0.005154201637708568, 0.008274127579967948, 0.0168843021403551, 0.007416385430301767, 0.009257236955148311, 0.00590385362565239, 0.011031745337733267, 0.011076277004617665, 0.01575522984526745, 0.005431270081282964), 'Vehicle loan or lease'))
As you can see in my dataframe i will must have 4 different columns. The first one should be the Int 491023, the second a tuple (i think dataframes don't have tuple type, so array also works), third another tuple and fourth a string. As you can see my tuples have different sizes.
The simplest command rdd.toDF()
don't work for me. Any ideas how can i achieve that?
You can create your dataframe like below , eventually you can pass an array(ArrayType())/list
from pyspark.sql import functions as F
df_a = spark.createDataFrame([('N110WA',['12','34'],1590038340000)],[ "reg","val1","val2"])
Output
+------+--------+-------------+
| reg| val1| val2|
+------+--------+-------------+
|N110WA|[12, 34]|1590038340000|
+------+--------+-------------+
Schema
df_a.printSchema()
root
|-- reg: string (nullable = true)
|-- val1: struct (nullable = true)
| |-- _1: string (nullable = true)
| |-- _2: string (nullable = true)
|-- val2: long (nullable = true)