I have a complex data frame with 10,999 rows.
I am trying to run xgboost for machine learning.
I load in the data and attempt to split it as I see in tutorials and by solutions posted on StackOverflow: How do I create test and train samples from one dataframe with pandas?
X_train, X_test = train_test_split(df, test_size=0.2)
but this fails:
TypeError: Expected sequence or array-like, got <class 'pyspark.sql.dataframe.DataFrame'>
But this doesn't make sense, how can I possibly put a dataframe into an array without losing lots of valuable information?
so I was advised to try pandas:
pandasDF = df.toPandas
X_train, X_test = train_test_split(pandasDF, test_size=0.2)
but this also fails:
TypeError: Singleton array array(<bound method PandasConversionMixin.toPandas of DataFrame
how can I split this dataframe into training and test sets?
Use this option:
pandasDF = df.toPandas()
If its taking time use this configuration before the conversion
spark.conf.set("spark.sql.execution.arrow.enabled", "true")