Is there any pyspark / MLLib version for this classic sklearm classic train_test_split code below?
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly,
target,
test_size = 0.2,
random_state = 123)
# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))
RandomSplit - as mentioned above - is the way to go
train, test = final_data.randomSplit([0.7,0.3], seed=4000)
Then, you can counts your labels in the train set
dataset_size=float(train.select("label").count())
Positives=train.select("label").where('label == 1').count()
percentage_ones=(float(Positives)/float(dataset_size))*100
Negatives=float(dataset_size-Positives)
print('The number of ones are {}'.format(Positives))
print('Percentage of ones are {}'.format(percentage_ones))
print(' The number of zeroes are {}'.format(Negatives))