python dataframe pyspark apache-spark-mllib

Is there any train_test_split in pyspark or MLLib?

Is there any pyspark / MLLib version for this classic sklearm classic train_test_split code below?

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = 123)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))

Solution

RandomSplit - as mentioned above - is the way to go

train, test = final_data.randomSplit([0.7,0.3], seed=4000)

Then, you can counts your labels in the train set

dataset_size=float(train.select("label").count())

Positives=train.select("label").where('label == 1').count()

percentage_ones=(float(Positives)/float(dataset_size))*100

Negatives=float(dataset_size-Positives)

print('The number of ones are {}'.format(Positives))

print('Percentage of ones are {}'.format(percentage_ones))

print(' The number of zeroes are {}'.format(Negatives))