python dataframe machine-learning insert outliers

How to manually insert outliers in the dummy dataset

I have created a dummy dataset with this below code :

from datetime import datetime
import numpy as np
import pandas as pd
from faker import Faker

fake = Faker()


def make_workers() -> list:
    status_list = ['in', 'out']
    room_list = ['FL1_RM1', 'FL1_RM2', 'FL1_RM3', 'FL1_RM4', 'FL2_RM1', 'FL2_RM2', 'FL2_RM3', 'FL2_RM4', 'FL3_RM1',
                 'FL3_RM2', 'FL3_RM3', 'FL3_RM4', 'FL4_RM1', 'FL4_RM2', 'FL4_RM3', 'FL4_RM4']
    Property = ['B1', 'B2', 'B3', 'B4']
    d1 = datetime.strptime('03/01/2022', '%m/%d/%Y')
    d2 = datetime.strptime('08/08/2022', '%m/%d/%Y')
    timestamps = pd.date_range(d1, d2, freq="1min")
    return [{**elem, **{"Floor_Number": elem.get("room_id")[2]}} for elem in [
        {'ID'          : fake.random_number(digits=6),
         'Property num': np.random.choice(Property, p=[0.25, 0.25, 0.25, 0.25]),
         'room_id'     : np.random.choice(room_list),
         'Temp'        : np.random.randint(low=35, high=50),
         'noted Date'  : timestamps[x],
         'Status'      : np.random.choice(status_list),
         'Humidity'    : np.random.uniform(low=-35.09, high=70.00),
         'Dust'        : np.random.randint(low=2, high=5),
         'CO2 level'   : np.random.uniform(low=350.09, high=450.00)
         } for x in range(len(timestamps))]]


worker_df = pd.DataFrame(make_workers())
worker_df.head(30)

the sample of the dataset has been shown in the pic below, now I want to insert some outliers for columns Temp and Humidity where its value would not be in the specified range in the code , for example : for Temp column, as per my initial specs can take only values in the range 35 to 50, now the outliers should have values >50 or <35 and same idea goes for humidity

Solution

How about something like this? d_min, d_max and mutate_chance can be set arbitrarily to get whatever date looks good for what you're doing. Inserting random values may be quicker as it could avoid apply (which can be slow for large datasets) but it ran in 0.4s for me with your dataset.

def mutate(n, d_min, d_max, mutate_chance=0.05, round=True):
  r = np.random.random()
  d = d_min + np.random.random() * (d_max - d_min)
  if r < mutate_chance / 2:
    # mutate high
    return int(n + d) if round else n + d
  elif r < mutate_chance:
    # mutate low
    return int(n - d) if round else n - d
  
  return n

oldTemp = worker_df['Temp']
oldHumidity = worker_df['Humidity']
worker_df['Temp'] = oldTemp.apply(lambda n: mutate(n, 20, 30))
worker_df['Humidity'] = oldHumidity.apply(lambda n: mutate(n, 34, 72, round=False))

Now if we run

print((oldTemp == worker_df['Temp']).value_counts())

we will get a table showing how many values have stayed the same (True) or changed (False). When I ran it I had 11419 become outliers and 218982 remained the same.

To see specifically which ones have changed we can do

print(oldTemp[oldTemp != worker_df['Temp']])
print(worker_df['Temp'][oldTemp != worker_df['Temp']])