Search code examples
pythonpandasdataframemergeout-of-memory

How to avoid memory issue when merging pandas dataframe on itself?


I’m running into a memory error that I think may have a workaround, but I can’t figure it out. (MemoryError: Unable to allocate 39.6 GiB for an array with shape (5309349043, ) and data type int64)

I’m using the answer in this post to count observations between two dates. The method, however, makes the size of the dataframe explode at the point of creating the dataframe “m.”

To put this in perspective, I have about 12,000 IDs, each with over 1,000 observations (and the date offset is 6 months, as opposed to 14 days in the working example below).

Some example data and the code I used is below, though this isn’t necessary to understand the issue.

Would doing this on each ID and looping through work? If so, how could I do that?

import pandas as pd
import numpy as np

df['Begin'] = df.Date + pd.DateOffset(days=-14)
m = df.reset_index().merge(df[['ID', 'Date']].reset_index(), on='ID')
m['to_count'] = m.Date_x.ge(m.Date_y) & m.Begin.le(m.Date_y) 
df['dates_between'] = m.groupby('index_x').to_count.sum() 

    ID      Date        Close
0   10000   2022-03-14  157.61
1   10000   2022-03-17  168.93
2   10000   2022-03-21  167.83
3   10000   2022-03-25  166.74
4   10000   2022-03-29  169.34
5   10000   2022-04-05  159.82
6   10000   2022-04-06  156.50
7   10000   2022-04-08  158.13
8   10000   2022-04-13  165.80
9   10000   2022-04-19  163.21
10  10000   2022-04-20  164.21
11  10000   2022-04-25  156.44
12  10000   2022-04-28  153.67
13  10000   2022-05-03  153.55
14  10000   2022-05-04  156.50
15  10000   2022-05-06  144.91
16  10000   2022-05-09  144.23
17  10000   2022-05-10  142.67
18  10000   2022-05-11  139.53
19  10000   2022-05-12  140.01
20  10000   2022-05-13  141.42
21  10000   2022-05-16  138.89
22  20000   2022-02-01  159.69
23  20000   2022-02-10  179.01
24  20000   2022-02-14  170.50
25  20000   2022-02-18  176.59
26  20000   2022-02-22  179.37
27  20000   2022-03-01  172.46
28  20000   2022-03-02  162.33
29  20000   2022-03-04  169.97
30  20000   2022-03-09  171.11
31  20000   2022-03-15  163.97
32  20000   2022-03-16  165.49
33  20000   2022-03-21  157.87
34  20000   2022-03-24  167.81
35  20000   2022-03-29  165.61
36  20000   2022-03-30  157.24
37  20000   2022-04-01  148.78
38  20000   2022-04-04  146.12
39  20000   2022-04-05  143.21
40  20000   2022-04-06  141.20
41  20000   2022-04-07  148.61
42  20000   2022-04-08  151.88
43  20000   2022-04-11  150.98
44  20000   2022-04-15  146.13

Solution

  • You definitely don't need to merge here, you can take advantage of numpy broadcasting to compare each date to all dates of the same group in a vectorized way:

    def to_count(sr):
        date = sr.values
        diff = date[:, None] - date
        mask = (pd.Timedelta('0D') <= diff) & (diff <= pd.Timedelta('14D'))
        return np.sum(mask, axis=1)
    
    df['dates_between'] = df.groupby('ID')['Date'].transform(to_count)
    

    If you prefer to reuse your code with pd.DateOffset instead of pd.Timedelta, you can also use:

    def to_count(sr):
        date = sr.values
        begin = sr.sub(pd.DateOffset(days=14)).values
        mask = (begin[:, None] <= date) & (date <= date[:, None])
        return np.sum(mask, axis=1)
    
    df['dates_between'] = df.groupby('ID')['Date'].transform(to_count)
    

    Output:

    >>> df
           ID       Date   Close  dates_between
    0   10000 2022-03-14  157.61              1
    1   10000 2022-03-17  168.93              2
    2   10000 2022-03-21  167.83              3
    3   10000 2022-03-25  166.74              4
    4   10000 2022-03-29  169.34              4
    5   10000 2022-04-05  159.82              3
    6   10000 2022-04-06  156.50              4
    7   10000 2022-04-08  158.13              5
    8   10000 2022-04-13  165.80              4
    9   10000 2022-04-19  163.21              5
    10  10000 2022-04-20  164.21              5
    ...
    34  20000 2022-03-24  167.81              4
    35  20000 2022-03-29  165.61              5
    36  20000 2022-03-30  157.24              5
    37  20000 2022-04-01  148.78              5
    38  20000 2022-04-04  146.12              6
    39  20000 2022-04-05  143.21              6
    40  20000 2022-04-06  141.20              7
    41  20000 2022-04-07  148.61              8
    42  20000 2022-04-08  151.88              8
    43  20000 2022-04-11  150.98              9
    44  20000 2022-04-15  146.13              8