I’m running into a memory error that I think may have a workaround, but I can’t figure it out. (MemoryError: Unable to allocate 39.6 GiB for an array with shape (5309349043, ) and data type int64)
I’m using the answer in this post to count observations between two dates. The method, however, makes the size of the dataframe explode at the point of creating the dataframe “m.”
To put this in perspective, I have about 12,000 IDs, each with over 1,000 observations (and the date offset is 6 months, as opposed to 14 days in the working example below).
Some example data and the code I used is below, though this isn’t necessary to understand the issue.
Would doing this on each ID and looping through work? If so, how could I do that?
import pandas as pd
import numpy as np
df['Begin'] = df.Date + pd.DateOffset(days=-14)
m = df.reset_index().merge(df[['ID', 'Date']].reset_index(), on='ID')
m['to_count'] = m.Date_x.ge(m.Date_y) & m.Begin.le(m.Date_y)
df['dates_between'] = m.groupby('index_x').to_count.sum()
ID Date Close
0 10000 2022-03-14 157.61
1 10000 2022-03-17 168.93
2 10000 2022-03-21 167.83
3 10000 2022-03-25 166.74
4 10000 2022-03-29 169.34
5 10000 2022-04-05 159.82
6 10000 2022-04-06 156.50
7 10000 2022-04-08 158.13
8 10000 2022-04-13 165.80
9 10000 2022-04-19 163.21
10 10000 2022-04-20 164.21
11 10000 2022-04-25 156.44
12 10000 2022-04-28 153.67
13 10000 2022-05-03 153.55
14 10000 2022-05-04 156.50
15 10000 2022-05-06 144.91
16 10000 2022-05-09 144.23
17 10000 2022-05-10 142.67
18 10000 2022-05-11 139.53
19 10000 2022-05-12 140.01
20 10000 2022-05-13 141.42
21 10000 2022-05-16 138.89
22 20000 2022-02-01 159.69
23 20000 2022-02-10 179.01
24 20000 2022-02-14 170.50
25 20000 2022-02-18 176.59
26 20000 2022-02-22 179.37
27 20000 2022-03-01 172.46
28 20000 2022-03-02 162.33
29 20000 2022-03-04 169.97
30 20000 2022-03-09 171.11
31 20000 2022-03-15 163.97
32 20000 2022-03-16 165.49
33 20000 2022-03-21 157.87
34 20000 2022-03-24 167.81
35 20000 2022-03-29 165.61
36 20000 2022-03-30 157.24
37 20000 2022-04-01 148.78
38 20000 2022-04-04 146.12
39 20000 2022-04-05 143.21
40 20000 2022-04-06 141.20
41 20000 2022-04-07 148.61
42 20000 2022-04-08 151.88
43 20000 2022-04-11 150.98
44 20000 2022-04-15 146.13
You definitely don't need to merge here, you can take advantage of numpy broadcasting to compare each date to all dates of the same group in a vectorized way:
def to_count(sr):
date = sr.values
diff = date[:, None] - date
mask = (pd.Timedelta('0D') <= diff) & (diff <= pd.Timedelta('14D'))
return np.sum(mask, axis=1)
df['dates_between'] = df.groupby('ID')['Date'].transform(to_count)
If you prefer to reuse your code with pd.DateOffset
instead of pd.Timedelta
, you can also use:
def to_count(sr):
date = sr.values
begin = sr.sub(pd.DateOffset(days=14)).values
mask = (begin[:, None] <= date) & (date <= date[:, None])
return np.sum(mask, axis=1)
df['dates_between'] = df.groupby('ID')['Date'].transform(to_count)
Output:
>>> df
ID Date Close dates_between
0 10000 2022-03-14 157.61 1
1 10000 2022-03-17 168.93 2
2 10000 2022-03-21 167.83 3
3 10000 2022-03-25 166.74 4
4 10000 2022-03-29 169.34 4
5 10000 2022-04-05 159.82 3
6 10000 2022-04-06 156.50 4
7 10000 2022-04-08 158.13 5
8 10000 2022-04-13 165.80 4
9 10000 2022-04-19 163.21 5
10 10000 2022-04-20 164.21 5
...
34 20000 2022-03-24 167.81 4
35 20000 2022-03-29 165.61 5
36 20000 2022-03-30 157.24 5
37 20000 2022-04-01 148.78 5
38 20000 2022-04-04 146.12 6
39 20000 2022-04-05 143.21 6
40 20000 2022-04-06 141.20 7
41 20000 2022-04-07 148.61 8
42 20000 2022-04-08 151.88 8
43 20000 2022-04-11 150.98 9
44 20000 2022-04-15 146.13 8