Search code examples

Convert RDD to DataFrame using pyspark

I have a file in spark with following data

Property ID|Location|Price|Bedrooms|Bathrooms|Size|Price SQ Ft|Status

i have read this file as rdd using


Now I need to convert this rdd into dataframe. I am using the below mentioned command

d=spark.createDataFrame(a).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")

But i am getting an error as TypeError: Can not infer schema for type: <class 'str'>


  • You can split the column first:

    d = spark.createDataFrame( x: x.split('|'))).toDF("Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status")

    Or equivalently, calling toDF on the RDD directly

    d = x: x.split('|')).toDF(["Property ID","Location","Price","Bedrooms","Bathrooms","Size","Price SQ Ft","Status"])

    In fact, I'd recommend using the Spark CSV reader for this purpose, which could handle the header appropriately too:

    df ='/FileStore/tables/realestate.txt', header=True, inferSchema=True, sep='|')