Search code examples
pythonapache-sparkpysparkrddapache-spark-sql

Convert RDD of LabeledPoint to DataFrame toDF() Error


I have a dataframe df which contains 13 values separated with comma. I want to get in df2 a dataFrame which contains labeledPoint. first value is label, twelve others are features. I use a split and select method to divide string with 13 value into an array of 13 values. map method allow me to create labeledPoint. Error come when I use toDF() method to convert RDD to DataFrame

df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0]),x[-12:])).toDF()

org.apache.spark.SparkException: Job aborted due to stage failure:

when I look in the stackerror I find:

IndexError: tuple index out of range.

in order to do test, I executed:

display(df.select(split(df[0], ',')))

I obtain my 13 values in an array for each row:

["2001.0","0.884123733793","0.610454259079","0.600498416968","0.474669212493","0.247232680947","0.357306088914","0.344136412234","0.339641227335","0.600858840135","0.425704689024","0.60491501652","0.419193351817"]

any Idea?


Solution

  • The Error come from the index x[0] should be replace by x[0][0]. So :

    df2 = df.select(split(df[0], ',')).map(lambda x: LabeledPoint(float(x[0][0]), x[0][-12:])).toDF()