Search code examples
machine-learninglogistic-regressionsupervised-learningoverfitting-underfitting

Overfitting in data frame that some rows repeated


I have a machine learning problem in a logistic regression algorithm. That I have a data frame where some rows and features are repeated like the below table:

feature 1 feature 2 feature 3 ... feature n-1 feature n Target
a1 a2 a3 .. an 1 1
b1 b2 b3 .. bn 1 0
c1 c2 c3 .. cn 1 1
.. .. .. .. .. 1 ..
a1 a2 a3 .. an 2 ..
b1 b2 b3 .. bn 2 ..
c1 c2 c3 .. cn 2 ..
.. .. .. .. .. 2 ..
a1 a2 a3 .. an 3 ..
b1 b2 b3 .. bn 3 ..
c1 c2 c3 .. cn 3 ..
.. .. .. .. .. .. ..


Is it possible to occur overfitting or underfitting with this data frame or not?
And what about a data frame that has between 6 or 8 features with about 500 rows?
I should add and notice this, rows that are repeated in features from 1 to n-1 vary in feature n.


Solution

  • Whether you overfit or not is due to:

    • the complexity of the model
    • the available data.

    But what's important is the actual data. If you double the data by repeating it, you don't effectively change the data you have. In fact, many algorithms randomly sample from the dataset. So, having duplicates changes nothing (except if the duplicated data has a different distribution than the non-duplicated data)

    As such, removing the duplication in the data will not affect whether your overfit or not.

    Edit: Now, if the data is not duplicated, but rather modified, it is a different story:

    where some rows and features are repeated

    Then, no effect.

    But if the data is modified, as the table shows, then you need to explain: Is this actual noisy measurements? Is this some random transcription/data collection error?

    However, if it is not errors in the dataset but actual data, then it is important to keep it. This is not about overfitting, this is about training with the actual data.