I'm now using Pandas in Python to handle some data.
Simplified DataFrame is
[ID, TimeDiff] and some other not-important columns.
For example:
73 1.166667
74 1.166667
75 2.183333
76 3.466667
77 2.666667
78 Na
TimeDiff means the difference value of time in data ID. 3.466667 means the time between ID 76 and ID 77 is 3.466667 hours. I want to split vessel data to make sure the difference value of time is within 2 hours so I need to split DataFrame into N different groups (in this example, N = 4).
I need to get results like this: whenever TimeDiff >= 2 create another group
--------------
73 1.166667
74 1.166667
-------------
75 2.183333
--------------
76 3.466667
--------------
77 NaN
I have tried to use Groupby in pandas.
df.groupby('TimeDiff')
But obviously, this is not what I want. I'm now trying to split DataFrame step by step like this: From
73 1.166667
74 1.166667
--------------
75 2.183333
76 3.466667
77 2.666667
78 Na
To
73 1.166667
74 1.166667
--------------
75 2.183333
--------------
76 3.466667
77 2.666667
78 Na
Then To
73 1.166667
74 1.166667
--------------
75 2.183333
--------------
76 3.466667
77 2.666667
78 Na
......
Finally to what I want:
--------------
73 1.166667
74 1.166667
-------------
75 2.183333
--------------
76 3.466667
--------------
77 NaN
4 groups Data. But after searching Google and StackOverflow I didn't find a proper way to handle it. Can somebody help me?
You can use cumsum
to create your groups:
df['Group'] = df['TimeDiff'].fillna(np.inf).ge(2).cumsum()
print(df)
# Output
ID TimeDiff Group
0 73 1.166667 0
1 74 1.166667 0
2 75 2.183333 1
3 76 3.466667 2
4 77 2.666667 3
5 78 NaN 4
Using groupby
:
>>> list(df.groupby(df['TimeDiff'].fillna(np.inf).ge(2).cumsum()))
[(0,
ID TimeDiff Group
0 73 1.166667 0
1 74 1.166667 0),
(1,
ID TimeDiff Group
2 75 2.183333 1),
(2,
ID TimeDiff Group
3 76 3.466667 2),
(3,
ID TimeDiff Group
4 77 2.666667 3),
(4,
ID TimeDiff Group
5 78 NaN 4)]