Search code examples
pythondataframepysparkapache-spark-mllib

Is there any train_test_split in pyspark or MLLib?


Is there any pyspark / MLLib version for this classic sklearm classic train_test_split code below?

from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(featuresonly, 
                                                    target, 
                                                    test_size = 0.2, 
                                                    random_state = 123)

# Show the results of the split
print("Training set has {} samples.".format(X_train.shape[0]))
print("Testing set has {} samples.".format(X_test.shape[0]))
print("Training set has good {} samples.".format(len(y_train) -y_train.sum()))
print("Testing set has good {} samples.".format(len(y_test) -y_test.sum()))

Solution

  • RandomSplit - as mentioned above - is the way to go

    train, test = final_data.randomSplit([0.7,0.3], seed=4000)
    

    Then, you can counts your labels in the train set

    dataset_size=float(train.select("label").count())
    
    Positives=train.select("label").where('label == 1').count()
    
    percentage_ones=(float(Positives)/float(dataset_size))*100
    
    Negatives=float(dataset_size-Positives)
    
    print('The number of ones are {}'.format(Positives))
    
    print('Percentage of ones are {}'.format(percentage_ones))
    
    print(' The number of zeroes are {}'.format(Negatives))