I have created a dummy dataset with this below code :
from datetime import datetime
import numpy as np
import pandas as pd
from faker import Faker
fake = Faker()
def make_workers() -> list:
status_list = ['in', 'out']
room_list = ['FL1_RM1', 'FL1_RM2', 'FL1_RM3', 'FL1_RM4', 'FL2_RM1', 'FL2_RM2', 'FL2_RM3', 'FL2_RM4', 'FL3_RM1',
'FL3_RM2', 'FL3_RM3', 'FL3_RM4', 'FL4_RM1', 'FL4_RM2', 'FL4_RM3', 'FL4_RM4']
Property = ['B1', 'B2', 'B3', 'B4']
d1 = datetime.strptime('03/01/2022', '%m/%d/%Y')
d2 = datetime.strptime('08/08/2022', '%m/%d/%Y')
timestamps = pd.date_range(d1, d2, freq="1min")
return [{**elem, **{"Floor_Number": elem.get("room_id")[2]}} for elem in [
{'ID' : fake.random_number(digits=6),
'Property num': np.random.choice(Property, p=[0.25, 0.25, 0.25, 0.25]),
'room_id' : np.random.choice(room_list),
'Temp' : np.random.randint(low=35, high=50),
'noted Date' : timestamps[x],
'Status' : np.random.choice(status_list),
'Humidity' : np.random.uniform(low=-35.09, high=70.00),
'Dust' : np.random.randint(low=2, high=5),
'CO2 level' : np.random.uniform(low=350.09, high=450.00)
} for x in range(len(timestamps))]]
worker_df = pd.DataFrame(make_workers())
worker_df.head(30)
the sample of the dataset has been shown in the pic below, now I want to insert some outliers for columns Temp and Humidity where its value would not be in the specified range in the code , for example : for Temp column, as per my initial specs can take only values in the range 35 to 50, now the outliers should have values >50 or <35 and same idea goes for humidity
How about something like this? d_min
, d_max
and mutate_chance
can be set arbitrarily to get whatever date looks good for what you're doing. Inserting random values may be quicker as it could avoid apply
(which can be slow for large datasets) but it ran in 0.4s for me with your dataset.
def mutate(n, d_min, d_max, mutate_chance=0.05, round=True):
r = np.random.random()
d = d_min + np.random.random() * (d_max - d_min)
if r < mutate_chance / 2:
# mutate high
return int(n + d) if round else n + d
elif r < mutate_chance:
# mutate low
return int(n - d) if round else n - d
return n
oldTemp = worker_df['Temp']
oldHumidity = worker_df['Humidity']
worker_df['Temp'] = oldTemp.apply(lambda n: mutate(n, 20, 30))
worker_df['Humidity'] = oldHumidity.apply(lambda n: mutate(n, 34, 72, round=False))
Now if we run
print((oldTemp == worker_df['Temp']).value_counts())
we will get a table showing how many values have stayed the same (True
) or changed (False
). When I ran it I had 11419 become outliers and 218982 remained the same.
To see specifically which ones have changed we can do
print(oldTemp[oldTemp != worker_df['Temp']])
print(worker_df['Temp'][oldTemp != worker_df['Temp']])