Search code examples
pythondataframedata-cleaningclass-method

How do I delete repeated elements in a dataframe made by a class-method function?


I have a simple python code for data cleaning. This code imports data from Excel that has the format like this:

    product cost   used_by prime
    name    price  gender  yes or no
    name    price  gender  yes or no
    ... and so on 

afterward I will get a mylist using the class function that looks something like this:

mylist =    [Item(comic,20.0,male,yes), 
             Item(paint,14.0,male,no), 
             Item(pen,5.0,female,nan), 
             Item(phone case,9.0,nan,no), 
             Item(headphone,40.0,male,yes), 
             Item(coat,nan,male,no),
             Item(comic,15.0,male,yes), 
             Item(nan,15.0,male,no)
             ... and so on]

and after all the filter and cleaning I will get a result that looks like this:

result = [Item(comic,20.0,male,yes), 
          Item(underwear,15.0,male,yes), 
          Item(comic,15.0,male,yes)
          ... 
          Item(underwear,15.0,male,yes), 
          ...and so on]

Here is the code I got so far:

import os
import pandas as pd
import math
cwd = os.path.abspath('') 
files = os.listdir(cwd)  
df = pd.DataFrame()
for file in files:
    if file.endswith('.XLSX'):
        df = df.append(pd.read_excel(file), ignore_index=True)

df = df.where(df.notnull(), None)
array = df.values.tolist()
print(array)

class Item():
    def has_all_properties(self):
        return bool(self.__name and not math.isnan(self.__cost) and self.__gender and self.__prime)
    def clean(self):
        return bool(self.__name and self.__cost <=20 and self.__gender == "male" and self.__prime == "yes")
    
    def __init__(self, name, cost, gender, prime):
        self.__name = name
        self.__cost = cost
        self.__gender = gender
        self.__prime = prime

    def __repr__(self):
        return f"Item({self.__name},{self.__cost},{self.__gender},{self.__prime})"

    def __tuple__(self): 
         return self.__name, self.__cost, self.__gender, self.__prime

mylist = [Item(*k) for k in array]

filtered = filter(Item.has_all_properties, mylist)
clean = filter(Item.clean, filtered)
result = list(clean)

t_list = [obj.__tuple__() for obj in result]
print(t_list)

output = pd.DataFrame(t_list, columns =['name', 'cost', 'gender', 'prime'])
print(output)
output.to_excel('clean_data.xlsx', index = False, header = True)

In the result, there are two type of repetitive data, one is like the underwear which have two exact same lines; and the other one is like the comic, with different cost values.

So what I want to do is remove one of the line that are exact the same for case one and keep the line that has the smaller cost value for case 2.

So for case two, I am think of reading the product to identify if they are the same and if they are I then compare their cost and keep the one with smaller value. But I am not sure if that is the right way of doing it.

I am aware that using pandas all the way work but I wish to explore the class function and use the customized data frame.

Can you suggest how to do this?


Solution

  • You could use a Set instead of a list, i.e., changing myList = [Item(*k) for k in array] to mySet = {Item(*k) for k in array}.

    Sets do not allow duplicates.

    EDIT Make sure to include implementations of __hash__ and __eq__ in your Item class so that set can know how to determine whether an item has a duplicate.

    In your case, the __eq__ would look something like the following:

    def __eq__(self, other):
      self.__name == other.__name
      self.__cost == other.__cost
      self.__gender == other.__gender
      self.__prime == other.__prime