Search code examples
pythonscikit-learnclassificationrulesinteraction

How do I extract meaningful simple rules from this classification problem?


I have a problem of this type: A customer creates an order by hand, which might be erroneous. Submitting a wrong order is costly, which is why we try to reduce the error rate.

I need to detect what factors cause an error, so that a new rule can be created, such as Product "A" and type "B" must not go together. All explanatory variables are categorical.

I have 2 questions:

  1. What approach do I take to extract simple but useful rules to give to a human expert for further review? A useful rule should cover as many errors as possible while covering as few non-errors as possible.
  2. How do I make sure variable interactions are taken into account?

Below is a sample dataset and a simple approach I took -- finding variables with high proportion of errors to be proposed as rules. I create a single interaction term by hand (based on prior knowledge, but I might be missing others).

I also tried using classification models (LASSO, Decision tree, RF), but I had an issue with 1. high dimensionality (especially when including many interactions), 2. extracting simple rules, since models use many coefficients even with regularization.

import pandas as pd

# Create sample dataset for task
df = pd.DataFrame(data={'error':[0,1,0,0,0,0,0,1,1,1],
                        'product':[1,2,1,2,2,3,4,2,2,2],
                        'type':[1,1,2,3,3,1,2,1,4,4],
                        'discount_level':[5,3,3,4,1,2,2,1,4,5],
                        'extra1':[1,1,1,2,2,2,3,3,3,3,],
                        'extra2':[1,2,3,1,2,3,1,2,3,1],
                        'extra3':[6,6,9,9,8,8,7,7,6,6]
                        })

# Variable interaction based on prior knowledge
df['product_type'] = df['product'].astype(str) + '_' + df['type'].astype(str)
X = df.drop('error', axis=1)

# Find groups with high portion of errors
groups_expl = pd.DataFrame()
for col in X.columns:
    groups = df.groupby(col).agg(count_all=('error', 'count'),
                                 count_error=('error', 'sum'))
    groups['portion_error'] = groups['count_error'] / groups['count_all']
    groups['column'] = col

    # Save groups with high portion of errors
    groups_expl = pd.concat([groups_expl, groups.loc[groups['portion_error']>0.8, :]], axis=0)
    groups_expl['col_val'] = groups_expl.index

print(groups_expl)

Thank you for help!


Solution

  • What approach do I take to extract simple but useful rules to give to a human expert for further review?

    You could experiment with a shallow bagging model. For example, XGBClassifier(n_estimators = 100, max_depth = 2).

    The idea is that each ensemble element comes to represent some feature combination that corresponds to elevated risk.

    How do I make sure variable interactions are taken into account?

    Decision tree models are easy to visualize and interpret, and they do feature interactions automatically.

    Imagine the following split logic:

    if product == 1:
      if extra == 3:
        return "high risk"
      else:
        return "no risk"
    else
      return "no risk"
    

    As you can see, this decision tree only contributes towards the total risk score when product == 1 and extra == 3. That's a feature interaction.