Search code examples
pythonpandasscikit-learnclassificationtraining-data

How can I specify a training set and test set from separate dataframes?


I have a dataframe with a mixture of news articles and Facebook posts (full texts) with a corresponding label (a single set of labels for all the texts - both the articles and the posts). However, I want to train my classifier on both types of texts (articles and posts), yet only have facebook posts in my test set. Is there anyway to specify a group of rows (grouped by a 'source' column) from which to extract the test set?

I'm using

sklearn.model_selection import train_test_split

and simpletransformers for the classification model.

Thanks!


Solution

  • Splitting is done the following way:

    # create X
    X = df[<columns>]
    # create y
    y = df[<one column>]
    # split to train and test
    X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify = y)
    

    If you have two dataframes, you need to unite them before:

    df = df1.append(df2)