I have a dataframe with a mixture of news articles and Facebook posts (full texts) with a corresponding label (a single set of labels for all the texts - both the articles and the posts). However, I want to train my classifier on both types of texts (articles and posts), yet only have facebook posts in my test set. Is there anyway to specify a group of rows (grouped by a 'source' column) from which to extract the test set?
I'm using
sklearn.model_selection import train_test_split
and simpletransformers for the classification model.
Thanks!
Splitting is done the following way:
# create X
X = df[<columns>]
# create y
y = df[<one column>]
# split to train and test
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.3, random_state=123, stratify = y)
If you have two dataframes, you need to unite them before:
df = df1.append(df2)