Search code examples
featuretools

Understanding the PercentTrue primitive output in featuretools


I've been playing with the predict-appointment-noshow notebook tutorial and I'm confused by the output of the PERCENT_TRUE primitive.

My understanding is that after feature generation, a column like locations.PERCENT_TRUE(appointments.sms_received) gives the percent of rows for which sms_received is True, given a single location, which was defined as its own Entity earlier. I'd expect that column to be the same for all rows of a single location, because that's what it was conditioned on, but I'm not finding that to be the case. Any ideas why?

Here's an example from that notebook data to demonstrate:

>>> fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()

count 144.00
mean 0.20
std 0.09
min 0.00
25% 0.20
50% 0.23
75% 0.26
max 0.31
Name: locations.PERCENT_TRUE(appointments.sms_received), dtype: float64

Even though the location is restricted to just 'HORTO', the column ranges from 0.00-0.31. How is this being calculated?


Solution

  • This is a result of using cutoff times when calculating this feature matrix.

    In this example, we are making predictions for every appointment at the time the appointment is scheduled. The feature locations.PERCENT_TRUE(appointments.sms_received) therefore is calculated at a specific time given by the cutoff times. It is calculating for each appointment "the percentage of appointments at this location received an an sms prior to the scheduled_time"

    That construction is necessary to prevent the leakage of future information into the prediction for that row at that time. If we were calculated PERCENT_TRUE using the whole dataset, we'd necessarily be using information from appointments that hadn't yet happened, which isn’t valid for predictive modeling.

    If you instead want to make the predictions after all of the data is known, all you have to do is remove the cutoff_time argument to the ft.dfs call:

    fm, features = ft.dfs(entityset=es,
                          target_entity='appointments',
                          agg_primitives=['count', 'percent_true'],
                          trans_primitives=['weekend', 'weekday', 'day', 'month', 'year'],
                          max_depth=3,
                          approximate='6h',
                          # cutoff_time=cutoff_times[20000:],
                          verbose=True)
    

    Now you can see that the feature is the same when we condition on a specific location

    fm.loc[fm.neighborhood == 'HORTO', 'locations.PERCENT_TRUE(appointments.sms_received)'].describe()
    count   175.00
    mean      0.32
    std       0.00
    min       0.32
    25%       0.32
    50%       0.32
    75%       0.32
    max       0.32
    

    You can read more about how Featuretools handles time in the documentation.