I have the following dataset:
date | name
2018-06-01 | John
2018-06-01 | Jake
2018-06-01 | Allan
2018-07-01 | Kate
2018-07-01 | Jake
2018-07-01 | Joe
(The date is set as index)
What I'm trying to achieve is: - for each period group (for instance a month) - identify elements (names) that were introduced or removed
So the resulting dataset as I imagine would look as:
date | action | name
2018-06-01 | Joiner | John
2018-06-01 | Joiner | Jake
2018-06-01 | Joiner | Allan
2018-07-01 | Joiner | Kate
2018-07-01 | Joiner | Joe
2018-07-01 | Withdrawal | John
2018-07-01 | Withdrawal | Allan
I am using Pandas dataframes. Semi-logical explanation of what i think i need to do is:
set = [] # resulting data set
for date, new_df in df.groupby(level=0): # breaking down the original set based the data into subsets
for i in new_df: # iterating through each subset
if (i in new_df-1): # comparing elements of each subset to the previous one
set_element = {definition of a "joiner element"}
set.append(set_element) # adding an element to a resulting data set
else:
if (i not in new_df-1):
set_element = {definition of a "withdrawal element"}
else:
pass
at the moment I'm getting
"unsupported operand type(s) for -: 'str' and 'int'"
So something tells me I can't use sets for string comparison... or i'm comparing wrong elements...
I think I was able to solve it somewhat on my own:
b = []
c = []
result = []
row = []
for date, new_df in df.groupby(level=0):
a = set(new_df['name']).difference(b)
for i in a:
row = ({'date':date, 'action':'Joiner', 'name':i})
result.append(row)
c = set(b).difference(set(new_df['name']))
for i in c:
row = ({'date':date, 'action':'Withdrawal', 'name':i})
result.append(row)
b = set(new_df['name'])
res_df = pd.DataFrame(data=result).set_index('date')