Search code examples
pythonpandasdataframepandas-groupby

Pandas dataframe intersection with varying groups


I have a large pandas dataframe with varying rows and columns but looks more or less like:

time   id    angle  ...
0.0    a1    33.67  ...
0.0    b2    35.90  ...
0.0    c3    42.01  ...
0.0    d4    45.00  ...
0.1    a1    12.15  ...
0.1    b2    15.35  ...
0.1    c3    33.12  ...
0.2    a1    65.28  ...
0.2    c3    87.43  ...
0.3    a1    98.85  ...
0.3    c3    100.12 ...
0.4    a1    11.11  ...
0.4    c3    83.22  ...
...

I am trying to aggregate the id's together and then find id's that have in common time-intervals. I have tried using pandas groupby and can easily group them by id and get their respective groups with information. How can I then take it a step further to find id's that also have the same time stamps?

Ideally I'd like to return intersection of certain fixed time intervals (2-3 seconds) for similar id's with the fixed time interval overlap:

time  id  angle  ...
0.0   a1  33.67  ...
0.1   a1  12.15  ...
0.2   a1  65.28  ...
0.3   a1  98.85  ...

0.0   c3  42.01  ...
0.1   c3  33.12  ...
0.2   c3  87.43  ...
0.3   c3  100.12 ...

Code tried so far:

#create pandas grouped by id
df1 = df.groupby(['id'], as_index=False)

Which outputs:

time  id  angle ...
(0.0   a1  33.67
...
0.4   a1  11.11)

(0.0  b2  35.90
0.1   b2  15.35)

(0.0  c3  42.01
...
0.4   c3  83.22)

(0.0  d4  45.00)

But I'd like to return only a dataframe where id and time are the same for a fixed interval, in the above example .4 seconds.

Any ideas on a fairly simple way to achieve this with pandas dataframes?


Solution

  • If need filter rows by some intervals - e.g. here between 0 and 0.4 and get all id which overlap use boolean indexing with Series.between first, then DataFrame.pivot:

    df1 = df[df['time'].between(0, 0.4)].pivot('time','id','angle')
    print (df1)
    id       a1     b2      c3    d4
    time                            
    0.0   33.67  35.90   42.01  45.0
    0.1   12.15  15.35   33.12   NaN
    0.2   65.28    NaN   87.43   NaN
    0.3   98.85    NaN  100.12   NaN
    0.4   11.11    NaN   83.22   NaN
    

    There are missing values for non overlap id, so remove columns with any NaNs by DataFrame.any and reshape to 3 columns by DataFrame.unstack and Series.reset_index:

    print (df1.dropna(axis=1))
    id       a1      c3
    time               
    0.0   33.67   42.01
    0.1   12.15   33.12
    0.2   65.28   87.43
    0.3   98.85  100.12
    0.4   11.11   83.22
    
    df2 = df1.dropna(axis=1).unstack().reset_index(name='angle')
    print (df2)
       id  time   angle
    0  a1   0.0   33.67
    1  a1   0.1   12.15
    2  a1   0.2   65.28
    3  a1   0.3   98.85
    4  a1   0.4   11.11
    5  c3   0.0   42.01
    6  c3   0.1   33.12
    7  c3   0.2   87.43
    8  c3   0.3  100.12
    9  c3   0.4   83.22