Search code examples
pythonpandaslistvariableszip

Zip variable becomes empty after running subsequent code


I have a dataframe (data) which contains a few dates (loss_date, report_date, good_date), and I'm trying to count certain rows of the dataframe. The following code works perfectly the first time I run it:

# Set up bins
BUCKET_SIZE = 30
min_date = np.min(data.loss_date)
max_date = np.max(data.report_date)
num_days = (max_date - min_date).days
num_buckets = int(np.ceil(num_days/BUCKET_SIZE))

bounds = [min_date + timedelta(days = BUCKET_SIZE*i)
               for i in range(0, num_buckets+1)
]

starts = bounds[0:len(bounds)-1]
ends = bounds[1:len(bounds)]
buckets = zip(starts, ends)

# Get data subset
l_data = data[data.good_date.notna()]
before = l_data[l_data.loss_date < l_data.good_date]
after = l_data[l_data.loss_date >= l_data.good_date]


# Define count function
def count_loss(df, start, end):
    is_start = df.loss_date >= start
    is_end = df.loss_date < end
    count = len(df[is_start & is_end].index)
    return(count)

# FIRST_TIME
count_before = [count_loss(before, s, e) for s,e in buckets]

But now when I run it again, e.g.

# CODE_AGAIN
count_after = [count_loss(after, s, e) for s,e in buckets]

I get the list [] as output. However if I run the following:

# CODE_AGAIN (but redefining buckets)
buckets = zip(starts, ends)
count_after = [count_loss(after, s, e) for s,e in buckets]

I get a non-empty list. After running FIRST_TIME, the buckets zip becomes empty - and repeating buckets = zip(starts, ends) fixes the problem; i.e. CODE_AGAIN works as it should. I can't understand why!

Many thanks.


Solution

  • In short, the problem is your title concept: "zip variable". This is not a static list; it's a generator object.

    buckets = zip(starts, ends)
    

    buckets is a callable interface, a function with a yield. Once you've iterated through the underlying structure, the generator is exhausted; any further references will yield None.

    If you want to iterate multiple times, either re-create the zip expression on each use, or store it as a list:

    buckets = list(zip(starts, ends))