I have a dataframe (data
) which contains a few dates (loss_date
, report_date
, good_date
), and I'm trying to count certain rows of the dataframe. The following code works perfectly the first time I run it:
# Set up bins
BUCKET_SIZE = 30
min_date = np.min(data.loss_date)
max_date = np.max(data.report_date)
num_days = (max_date - min_date).days
num_buckets = int(np.ceil(num_days/BUCKET_SIZE))
bounds = [min_date + timedelta(days = BUCKET_SIZE*i)
for i in range(0, num_buckets+1)
]
starts = bounds[0:len(bounds)-1]
ends = bounds[1:len(bounds)]
buckets = zip(starts, ends)
# Get data subset
l_data = data[data.good_date.notna()]
before = l_data[l_data.loss_date < l_data.good_date]
after = l_data[l_data.loss_date >= l_data.good_date]
# Define count function
def count_loss(df, start, end):
is_start = df.loss_date >= start
is_end = df.loss_date < end
count = len(df[is_start & is_end].index)
return(count)
# FIRST_TIME
count_before = [count_loss(before, s, e) for s,e in buckets]
But now when I run it again, e.g.
# CODE_AGAIN
count_after = [count_loss(after, s, e) for s,e in buckets]
I get the list []
as output. However if I run the following:
# CODE_AGAIN (but redefining buckets)
buckets = zip(starts, ends)
count_after = [count_loss(after, s, e) for s,e in buckets]
I get a non-empty list. After running FIRST_TIME
, the buckets zip becomes empty - and repeating buckets = zip(starts, ends)
fixes the problem; i.e. CODE_AGAIN
works as it should. I can't understand why!
Many thanks.
In short, the problem is your title concept: "zip variable". This is not a static list; it's a generator object.
buckets = zip(starts, ends)
buckets
is a callable interface, a function with a yield
. Once you've iterated through the underlying structure, the generator is exhausted; any further references will yield None
.
If you want to iterate multiple times, either re-create the zip
expression on each use, or store it as a list
:
buckets = list(zip(starts, ends))