Search code examples
pythonpandasinteger-division

OverFlow error with Pandas series.apply


I have a function that works fine with individual values, but when I use it with pandas series.apply(), it gives an OverflowError.

from __future__ import division
import pandas as pd
import numpy as np

birthdays = pd.DataFrame(np.empty([365,2]), columns = ['k','probability'], index = range(1,366))
birthdays['k'] = birthdays.index

I make a function:

def probability_of_shared_bday(k):
    end_point = 366 - k
    numerator = 1
    for i in range(end_point, 366):
        numerator = numerator*i
    denominator = 365**k
    probability_of_no_match = (1 - numerator/denominator)
    return probability_of_no_match

when I try this out with individual integers, it works fine:

 probability_of_shared_bday(1)

0.0

 probability_of_shared_bday(100)

0.9999996927510721

But when I try and use this function with apply:

birthdays['probability'] = birthdays['k'].apply(probability_of_shared_bday, convert_dtype=False)

OverflowError: integer division result too large for a float

This happens regardless of if convert_dtype is True or False.

Checking birthdays['k'].dtypes I get dtype('int64')


Solution

  • I'm not sure why you have this problem with apply, but you should not write the function like you did in the first place. Here is a suggestion that avoids dividing two huge numbers one by another:

    def probability_of_shared_bday(k):
        end_point = 366 - k
        ratio = 1
        for i in range(end_point, 366):
            ratio *= i / 365
        probability_of_no_match = (1 - ratio)
        return probability_of_no_match
    

    And the problem goes away!