python machine-learning feature-selection kaggle

Does the test set need data cleaning in machine learning?

I am on an interesting machine learning project about the NYC taxi data (https://s3.amazonaws.com/nyc-tlc/trip+data/green_tripdata_2017-04.csv), the target is predicting the tip amount, the raw data looks like (2 data samples):

   VendorID lpep_pickup_datetime lpep_dropoff_datetime store_and_fwd_flag  \
0         2  2017-04-01 00:03:54   2017-04-01 00:20:51                  N   
1         2  2017-04-01 00:00:29   2017-04-01 00:02:44                  N   

   RatecodeID  PULocationID  DOLocationID  passenger_count  trip_distance  \
0           1            25            14                1           5.29   
1           1           263            75                1           0.76   

   fare_amount  extra  mta_tax  tip_amount  tolls_amount  ehail_fee  \
0         18.5    0.5      0.5        1.00           0.0        NaN   
1          4.5    0.5      0.5        1.45           0.0        NaN   

   improvement_surcharge  total_amount  payment_type  trip_type  
0                    0.3         20.80             1        1.0  
1                    0.3          7.25             1        1.0

There are five different 'payment_type', indicated by numerical number 1,2,3,4,5

I find that only when the 'payment_type' is 1, the 'tip_amount' is meaningful, 'payment_type' 2,3,4,5 all have zero tip:

for i in range(1,6):
    print(raw[raw["payment_type"] == i][['tip_amount', 'payment_type']].head(2))

gives:

   tip_amount  payment_type
0        1.00             1
1        1.45             1
   tip_amount  payment_type
5         0.0             2
8         0.0             2
     tip_amount  payment_type
100         0.0             3
513         0.0             3
     tip_amount  payment_type
59          0.0             4
102         0.0             4
       tip_amount  payment_type
46656         0.0             5
53090         0.0             5

First question: I want to build a regression model for 'tip_amount', if i use the 'payment_type' as a feature, can the model automatically handle this kind of behavior?

Second question: We know that the 'tip_amount' is actually not zero for 'payment_type' 2,3,4,5, just not being correctly recorded, if I drop these data samples and only keep the 'payment_type' == 1, then when using the model for unseen test dataset, it can not predict 'payment_type' 2,3,4,5 to zero tip, so I have to keep the 'payment_type' as an important feature right?

Third question: Let's say I keep all different 'payment_type' data samples and the model is able to predict zero tip amount for 'payment_type' 2,3,4,5 but is this what we really want? Because the underlying true tip should not be zero, it's just how the data looks like.

Solution

A common saying for machine learning goes garbage in, garbage out. Often, feature selection and data preprocessing is more important than your model architecture.

First question:

Yes

Second question:

Since payment_type of 2, 3, 4, 5 all result in 0, why not just keep it simple. Replace all payment types that are not 1 with 0. This will let your model easily correlate 1 to being paid and 0 to not being paid. It also reduces the amount of things your model will have to learn in the future.

Third question:

If the "underlying true tip" is not reflected in the data, then it is simply impossible for your model to learn it. Whether this inaccurate representation of the truth is what we want or not what we want is a decision for you to make. Ideally you would have data that shows the actual tip.

Preprocessing your data is very important and will help your model tremendously. Besides making some changes to your payment_type features, you should also look into normalizing your data, which will help your machine learning algorithm better generalize relations between your data.