While trying to work on credit card fraud dataset on Kaggle (link), I found out that I can have a better model if I reduce the size of the dataset for the training. Just to explain the dataset is composed of 284807 records of 31 features. In this dataset there is only 492 frauds (so only 0.17%).
I've tried to do a PCA on the full dataset to keep only the 3 most important dimensions to be able to display it. The result is the following one :
In this one, it's impossible to find a pattern to determine either it's a fraud or not.
If I reduce the dataset of non fraud only to increase the ratio (fraud/non_fraud), this is what I have with the same plot
Now, I don't know if it makes sense to fit a PCA on a reduced dataset in order to have a better decomposition. For example, if I use the PCA with 100000 points, we can say that all entries with a PCA1 > 5 is a fraud.
This is the code if you want to try it :
dataset = pd.read_csv("creditcard.csv")
sample_size = 284807-492 # between 1 and 284807-492
a = dataset[dataset["Class"] == 1] # always keep all frauds
b = dataset[dataset["Class"] == 0].sample(sample_size) # reduce non fraud qty
dataset = pd.concat([a, b]).sample(frac=1) # concat with a shuffle
# Scaling of features for the PCA
y = dataset["Class"]
X = dataset.drop("Class", axis=1)
X_scale = StandardScaler().fit_transform(X)
# Doing PCA on the dataset
pca = PCA(n_components=3)
X_pca = pca.fit_transform(X_scale)
pca1, pca2, pca3, c = X_pca[:, 0], X_pca[:, 1], X_pca[:, 2], y
plt.scatter(pca1, pca2, s=pca3, c=y)
plt.xlabel("PCA1")
plt.ylabel("PCA2")
plt.title("{}-points".format(sample_size))
# plt.savefig("{}-points".format(sample_size), dpi=600)
Thanks for your help,
It makes sense, definitely.
The technique you are using is commonly known as Random Undersampling, and in ML it is useful in general when you are dealing with imbalanced data problems (such as the one you are describing). You can see more about it this Wikipedia page.
There are, of course, many other methods to dealt with class imbalance, but the beauty of this one is that it is quite simple and, sometimes, really effective.