I have a simple python code for data cleaning. This code imports data from Excel that has the format like this:
product cost used_by prime
name price gender yes or no
name price gender yes or no
... and so on
afterward I will get a mylist
using the class function that looks something like this:
mylist = [Item(comic,20.0,male,yes),
Item(paint,14.0,male,no),
Item(pen,5.0,female,nan),
Item(phone case,9.0,nan,no),
Item(headphone,40.0,male,yes),
Item(coat,nan,male,no),
Item(comic,15.0,male,yes),
Item(nan,15.0,male,no)
... and so on]
and after all the filter and cleaning I will get a result that looks like this:
result = [Item(comic,20.0,male,yes),
Item(underwear,15.0,male,yes),
Item(comic,15.0,male,yes)
...
Item(underwear,15.0,male,yes),
...and so on]
Here is the code I got so far:
import os
import pandas as pd
import math
cwd = os.path.abspath('')
files = os.listdir(cwd)
df = pd.DataFrame()
for file in files:
if file.endswith('.XLSX'):
df = df.append(pd.read_excel(file), ignore_index=True)
df = df.where(df.notnull(), None)
array = df.values.tolist()
print(array)
class Item():
def has_all_properties(self):
return bool(self.__name and not math.isnan(self.__cost) and self.__gender and self.__prime)
def clean(self):
return bool(self.__name and self.__cost <=20 and self.__gender == "male" and self.__prime == "yes")
def __init__(self, name, cost, gender, prime):
self.__name = name
self.__cost = cost
self.__gender = gender
self.__prime = prime
def __repr__(self):
return f"Item({self.__name},{self.__cost},{self.__gender},{self.__prime})"
def __tuple__(self):
return self.__name, self.__cost, self.__gender, self.__prime
mylist = [Item(*k) for k in array]
filtered = filter(Item.has_all_properties, mylist)
clean = filter(Item.clean, filtered)
result = list(clean)
t_list = [obj.__tuple__() for obj in result]
print(t_list)
output = pd.DataFrame(t_list, columns =['name', 'cost', 'gender', 'prime'])
print(output)
output.to_excel('clean_data.xlsx', index = False, header = True)
In the result, there are two type of repetitive data, one is like the underwear
which have two exact same lines; and the other one is like the comic
, with different cost values.
So what I want to do is remove one of the line that are exact the same for case one and keep the line that has the smaller cost value for case 2.
So for case two, I am think of reading the product to identify if they are the same and if they are I then compare their cost and keep the one with smaller value. But I am not sure if that is the right way of doing it.
I am aware that using pandas all the way work but I wish to explore the class function and use the customized data frame.
Can you suggest how to do this?
You could use a Set
instead of a list, i.e., changing myList = [Item(*k) for k in array]
to mySet = {Item(*k) for k in array}
.
Sets do not allow duplicates.
EDIT
Make sure to include implementations of __hash__
and __eq__
in your Item
class so that set can know how to determine whether an item has a duplicate.
In your case, the __eq__
would look something like the following:
def __eq__(self, other):
self.__name == other.__name
self.__cost == other.__cost
self.__gender == other.__gender
self.__prime == other.__prime