Search code examples
apache-sparkdataframepysparkrdd

convert rdd to dataframe without schema in pyspark


I'm trying to convert an rdd to dataframe with out any schema. I tried below code. It's working fine, but the dataframe columns are getting shuffled.

def f(x):
    d = {}
    for i in range(len(x)):
        d[str(i)] = x[i]
    return d
rdd = sc.textFile("test")
df = rdd.map(lambda x:x.split(",")).map(lambda x :Row(**f(x))).toDF()
df.show()

Solution

  • If you don't want to specify a schema, do not convert use Row in the RDD. If you simply have a normal RDD (not an RDD[Row]) you can use toDF() directly.

    df = rdd.map(lambda x: x.split(",")).toDF()
    

    You can give names to the columns using toDF() as well,

    df = rdd.map(lambda x: x.split(",")).toDF("col1_name", ..., "colN_name")
    

    If what you have is an RDD[Row] you need to actually know the type of each column. This can be done by specifying a schema or as follows

    val df = rdd.map({ 
      case Row(val1: String, ..., valN: Long) => (val1, ..., valN)
    }).toDF("col1_name", ..., "colN_name")