Search code examples
pythonapache-sparkmachine-learningpysparkxgboost

How to get `train_test_split` to work with a dataframe?


I have a complex data frame with 10,999 rows.

I am trying to run xgboost for machine learning.

I load in the data and attempt to split it as I see in tutorials and by solutions posted on StackOverflow: How do I create test and train samples from one dataframe with pandas?

X_train, X_test = train_test_split(df, test_size=0.2)

but this fails:

TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>

But this doesn't make sense, how can I possibly put a dataframe into an array without losing lots of valuable information?

so I was advised to try pandas:

pandasDF = df.toPandas
X_train, X_test = train_test_split(pandasDF, test_size=0.2)

but this also fails:

TypeError: Singleton array array(<bound method PandasConversionMixin.toPandas of DataFrame

how can I split this dataframe into training and test sets?


Solution

  • Use this option:

    pandasDF = df.toPandas()
    

    If its taking time use this configuration before the conversion

    spark.conf.set("spark.sql.execution.arrow.enabled", "true")