Search code examples
pythonstringdataframesetcomparison

Comparing string elements within a time series in Python


I have the following dataset:

date       | name
2018-06-01 | John
2018-06-01 | Jake
2018-06-01 | Allan
2018-07-01 | Kate
2018-07-01 | Jake
2018-07-01 | Joe

(The date is set as index)

What I'm trying to achieve is: - for each period group (for instance a month) - identify elements (names) that were introduced or removed

So the resulting dataset as I imagine would look as:

date       | action | name
2018-06-01 | Joiner | John
2018-06-01 | Joiner | Jake
2018-06-01 | Joiner | Allan
2018-07-01 | Joiner | Kate
2018-07-01 | Joiner | Joe
2018-07-01 | Withdrawal | John
2018-07-01 | Withdrawal | Allan

I am using Pandas dataframes. Semi-logical explanation of what i think i need to do is:

set = [] # resulting data set
for date, new_df in df.groupby(level=0): # breaking down the original set based the data into subsets
    for i in new_df: # iterating through each subset
        if (i in new_df-1): # comparing elements of each subset to the previous one
           set_element = {definition of a "joiner element"} 
           set.append(set_element) # adding an element to a resulting data set
        else:
            if (i not in new_df-1):
               set_element = {definition of a "withdrawal element"}
            else:
                pass

at the moment I'm getting

"unsupported operand type(s) for -: 'str' and 'int'" 

So something tells me I can't use sets for string comparison... or i'm comparing wrong elements...


Solution

  • I think I was able to solve it somewhat on my own:

    b = []
    c = []
    result = []
    row = []
    for date, new_df in df.groupby(level=0):
        a = set(new_df['name']).difference(b)
        for i in a:
            row = ({'date':date, 'action':'Joiner', 'name':i})
            result.append(row)  
        c = set(b).difference(set(new_df['name']))
        for i in c:
            row = ({'date':date, 'action':'Withdrawal', 'name':i})
            result.append(row)
        b = set(new_df['name'])
    
    res_df = pd.DataFrame(data=result).set_index('date')