Search code examples
pythonpandasstatistics

Python: 3-sigma "anomaly" detector of order quantity and lines


I'm wanting to analyze transactional data from an e-commerce context, where the focus is on detecting atypical activities in order patterns. The data is grouped by a customer identifier, SoldTo, and for each group, we apply a straightforward statistical technique to detect anomalies based on the order quantities and line counts. Specifically, the steps involve:

  1. Data Preparation: Ensuring the date column (Created_on) is in the correct format.
  2. Grouping Data: Grouping the data by the SoldTo field to isolate the transactions for each customer. Rolling Window Calculations: For each group, applying rolling window calculations to compute rolling averages and standard deviations for order quantities and line counts.
  3. Apply 3-Sigma Rule Application: Using the 3-sigma rule to identify transactions that deviate significantly from the norm, marking these as atypical or suspicious.
  4. Independent Processing: Processing each customer group (SoldTo) independently to ensure that the detection mechanism is not influenced by data from other customers. Combining Results: After processing, we write this data to a .csv file.

My issue: As a test, I feed ONE known individual SoldTo, having anomalous lines and the code works to detect the anomalous lines, as intended! But when I introduce more than a single SoldTo including those known anomalous lines, they are no longer detected. Why would this be?

Here's what I've tried using my code alongside (what I hope will be) two convenient-to-load-data set/s (one having only a single SoldTo where the my code will detect known anomalous transactions-- and another data set having (2) SoldTo's <- my code no longer works to detect known anomalous lines when 2+ Soldto's are used together...

# Create DataFrame
df = pd.DataFrame(fraud)

df['Created_on'] = pd.to_datetime(df['Created_on'])

# Group by 'SoldTo' and 'Created_on'
grouped = df.groupby(['SoldTo', 'Created_on', 'Sales_Doc']).agg(
    total_quantity=('Order_Quantity', 'sum'),
    line_count=('Sales_Doc', 'count')    # Modified this line so the provided data sets can be used.   Thanks @Timus
).reset_index()

# Compute rolling statistics and 3-sigma for each SoldTo group
grouped['avg_line'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['ma_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).mean())
grouped['stDev_of_qty'] = grouped.groupby('SoldTo')['total_quantity'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))
grouped['stDev_of_lines'] = grouped.groupby('SoldTo')['line_count'].transform(lambda x: x.rolling(3, min_periods=1).std(ddof=0))

# Compute the 3-sigma thresholds
grouped['avg_qty_sigma_trigger'] = ((3 * grouped['stDev_of_qty']) + grouped['ma_qty'])
grouped['avg_line_sigma_trigger'] = ((3 * grouped['stDev_of_lines']) + grouped['avg_line'])

# Function to identify atypical rows based on 3-sigma rule within each SoldTo group
def identify_atypical(df):
    atypical_indices = []

    for sold_to, group in df.groupby('SoldTo'):
#        group = group.reset_index(drop=True) # Removed this line. Thx @Timus
        
        for i in range(len(group) - 1):
            current_row = group.iloc[i]
            next_row = group.iloc[i + 1]

            if (next_row['line_count'] > current_row['avg_line_sigma_trigger'] or
                next_row['total_quantity'] > current_row['avg_qty_sigma_trigger']):
                atypical_indices.append(group.index[i + 1])

    # Mark atypical rows in the dataframe
    df['is_atypical'] = False
    df.loc[atypical_indices, 'is_atypical'] = True

    return df, atypical_indices


# Identify atypical rows
grouped, atypical_indices = identify_atypical(grouped)

# Print the dataframe and indices of atypical rows
print("Atypical rows indices:", atypical_indices)
print("")

print(grouped)

# Filter atypical rows within a specified date range
#check_these = grouped[(grouped['is_atypical'] == True) & (grouped['Created_on'] >= '2024-06-01')]
check_these = grouped[(grouped['is_atypical'] == True) & (grouped['total_quantity'] != 1) & (grouped['line_count'] != 1) ]
#check_these = grouped[(grouped['is_atypical'] == True)]

# Save the cleaned dataframe to a CSV file
check_these.sort_values(by='SoldTo', ascending=True).to_csv('order_behavior_analysis_3.csv', index=False)

When Using this data, having only a single Soldto, the code returns results as needed: Atypical rows indices: [5, 6, 11]

[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
 ['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
 ['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]

But using this data, having (2) SoldTo values, the code returns new rows, but no longer "detects" the known atypical rows and reports back different indices: Atypical rows indices: [1, 6, 12, 14, 17, 5, 6, 11]

[['SoldTo', 'Created_on', 'Sales_Doc', 'Order_Quantity'],
 ['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 10],
 ['56619720', Timestamp('2023-01-13 00:00:00'), 108036530, 1],
 ['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 20],
 ['56619720', Timestamp('2023-03-03 00:00:00'), 108391209, 2],
 ['56619720', Timestamp('2023-04-13 00:00:00'), 108738953, 30],
 ['56619720', Timestamp('2023-07-24 00:00:00'), 109827151, 20],
 ['56619720', Timestamp('2023-09-20 00:00:00'), 110467726, 30],
 ['56619720', Timestamp('2023-10-11 00:00:00'), 110658107, 10],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 2],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 3],
 ['56619720', Timestamp('2023-11-10 00:00:00'), 110946376, 5],
 ['56619720', Timestamp('2023-12-13 00:00:00'), 111681360, 5],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 6],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 4],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
 ['56619720', Timestamp('2023-12-19 00:00:00'), 111739909, 2],
 ['56619720', Timestamp('2024-01-25 00:00:00'), 112057996, 5],
 ['56619720', Timestamp('2024-02-23 00:00:00'), 112322261, 12],
 ['56619720', Timestamp('2024-03-07 00:00:00'), 112453024, 5],
 ['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 5],
 ['56619720', Timestamp('2024-03-25 00:00:00'), 112625572, 3],
 ['56619720', Timestamp('2024-03-27 00:00:00'), 112651496, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 5],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-04-26 00:00:00'), 112942567, 2],
 ['56619720', Timestamp('2024-05-09 00:00:00'), 113200232, 2],
 ['56619720', Timestamp('2024-05-22 00:00:00'), 113359192, 2],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 1],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 34],
 ['56619720', Timestamp('2024-06-10 00:00:00'), 113534221, 20],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 4],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 2],
 ['59908158', Timestamp('2023-11-02 00:00:00'), 110866572, 17],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 2],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 6],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110884032, 4],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 11],
 ['59908158', Timestamp('2023-11-06 00:00:00'), 110893468, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902368, 33],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 10],
 ['59908158', Timestamp('2023-11-07 00:00:00'), 110902525, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110929917, 6],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 16],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 10],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 4],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 1],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 20],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 8],
 ['59908158', Timestamp('2023-11-09 00:00:00'), 110930046, 3],
 ['59908158', Timestamp('2023-11-13 00:00:00'), 110966070, 52],
 ['59908158', Timestamp('2023-11-15 00:00:00'), 111035845, 15],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 18],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177113, 5],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 20],
 ['59908158', Timestamp('2023-11-16 00:00:00'), 111177887, 11],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 8],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 4],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 16],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 12],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 3],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 18],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 20],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 7],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 10],
 ['59908158', Timestamp('2023-11-20 00:00:00'), 111430236, 22],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 9],
 ['59908158', Timestamp('2023-11-21 00:00:00'), 111446837, 13]]

Thanks for help interpreting my path forward.


Solution

  • As I have written in the comments, I am pretty sure that the resetting of the group indices in the function identify_atypical is messing up the final update of the orignal dataframe. So I'd suggest that you try something like the following:

    def identify_atypical(df):
        atypical_idxs = set()
        for _, group in df.groupby('SoldTo'):
            m = (
                group['line_count'].gt(group['avg_line_sigma_trigger']).shift())
                | group['total_quantity'].gt(group['avg_qty_sigma_trigger'].shift())
            )
            atypical_idxs.update(group[m].index)
        return df.assign(is_atypical=df.index.isin(atypical_idxs)), atypical_idxs
    

    (I can't run a full test because the sample is incomplete, but on a reduced dataframe it looked fine.)