Search code examples
pythonrpandasscipystatsmodels

Using python to compute relative risk (risk ratio) from a dataframe with support of the zepid package (simulate the riskratio from epitools r pack.)


I'm coming from this post and this website of the zEpid package. My goal is to compute the relative risk (aka risk ratio) from a pd.DataFrame. The independent variable has three levels (1,2, and 3) and my dependent variable (target) has two levels (1 and 0).

When I adapt the following code

import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio


# calculating p-value
est= rr.results['RiskRatio'][1]
std = rr.results['SD(RR)'][1]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2

Python is returning:

    106         # Getting unique values and dropping reference
    107         vals = set(df[exposure].dropna().unique())
--> 108         vals.remove(self.reference)
    109         self._c = df.loc[(df[exposure] == self.reference) & (df[outcome] == 1)].shape[0]
    110         self._d = df.loc[(df[exposure] == self.reference) & (df[outcome] == 0)].shape[0]

KeyError: 0

Any help is appreciated.

The dataframe is below

df = pd.DataFrame(
{'age_group': {2759: 1, 3852: 1, 631: 1, 5152: 2, 5334: 2, 6364: 2, 5652: 2, 1636: 1, 2869: 1, 4654: 2, 1888: 1, 247: 1, 6699: 1, 6471: 2, 1760: 1, 4182: 1, 6095: 1, 48: 1, 6348: 1, 5129: 2, 4920: 2, 4590: 1, 5892: 2, 5131: 2, 1649: 2, 5940: 2, 3960: 1, 3060: 2, 4852: 1, 4605: 1, 3475: 1, 4406: 1, 1958: 1, 2170: 2, 6478: 1, 5328: 2, 4063: 3, 6827: 1, 5085: 1, 5155: 1, 4879: 2, 3185: 1, 32: 1, 4690: 1, 4109: 1, 4617: 1, 1048: 2, 747: 1, 995: 1, 6454: 1, 3302: 3, 5984: 2, 1127: 2, 2165: 2, 2025: 1, 4985: 2, 227: 3, 5802: 1, 4623: 1, 438: 1, 4401: 3, 7099: 1, 1149: 1, 6772: 1, 5567: 1, 873: 2, 2957: 1, 7060: 1, 4206: 2, 5239: 1, 1557: 1, 6080: 1, 411: 2, 2139: 1, 2408: 2, 1189: 2, 3295: 3, 4728: 3, 2490: 1, 4147: 1, 6768: 1, 6810: 1, 2901: 1, 3981: 2, 4941: 1, 3879: 2, 5819: 1, 6662: 2, 1589: 2, 6170: 1, 4522: 1, 552: 2, 5270: 1, 2722: 2, 34: 1, 5193: 1, 5767: 1, 2670: 1, 3298: 1, 5542: 1}, 'adhd_parent': {2759: 0, 3852: 0, 631: 0, 5152: 1, 5334: 1, 6364: 1, 5652: 1, 1636: 0, 2869: 0, 4654: 1, 1888: 0, 247: 0, 6699: 0, 6471: 1, 1760: 0, 4182: 0, 6095: 0, 48: 0, 6348: 0, 5129: 1, 4920: 1, 4590: 0, 5892: 1, 5131: 1, 1649: 1, 5940: 1, 3960: 0, 3060: 1, 4852: 0, 4605: 0, 3475: 0, 4406: 0, 1958: 0, 2170: 1, 6478: 0, 5328: 1, 4063: 1, 6827: 0, 5085: 0, 5155: 0, 4879: 1, 3185: 0, 32: 0, 4690: 0, 4109: 0, 4617: 0, 1048: 1, 747: 0, 995: 0, 6454: 0, 3302: 1, 5984: 1, 1127: 1, 2165: 1, 2025: 0, 4985: 1, 227: 1, 5802: 0, 4623: 0, 438: 0, 4401: 1, 7099: 0, 1149: 0, 6772: 0, 5567: 0, 873: 1, 2957: 0, 7060: 0, 4206: 1, 5239: 0, 1557: 0, 6080: 0, 411: 1, 2139: 0, 2408: 1, 1189: 1, 3295: 1, 4728: 1, 2490: 0, 4147: 0, 6768: 0, 6810: 0, 2901: 0, 3981: 1, 4941: 0, 3879: 1, 5819: 0, 6662: 1, 1589: 1, 6170: 0, 4522: 0, 552: 1, 5270: 0, 2722: 1, 34: 0, 5193: 0, 5767: 0, 2670: 0, 3298: 0, 5542: 0}}
)

Solution

  • The error is a result of how RiskRatio is parsing your input data set behind the scenes.

    When using RiskRatio, the default reference category is set to 0. So, when you independent variable is being processed internally, zEpid is looking for age_group=0. However, there are no instances of 0 in your data set.

    To fix this, you can specify the optional argument reference. By default reference=0 but you can set it to 1, which will set age_group=1 as the reference risk for the risk ratio.

    The following is a simple example with some simulated data with 'A' and 'Y'

    import numpy as np
    import pandas as pd
    from scipy.stats import norm
    from zepid import RiskRatio
    
    np.random.seed(20220120)
    df = pd.DataFrame()
    df['A'] = np.random.randint(1, 4, size=100)
    df['Y'] = np.random.binomial(n=1, p=0.25, size=100)
    
    # Generating some generic data
    np.random.seed(20220120)
    df = pd.DataFrame()
    df['A'] = np.random.randint(1, 4, size=80)           # Note: A \in {1,2,3}
    df['Y'] = np.random.binomial(n=1, p=0.25, size=80)   # Note: Y \in {0,1}
    
    # Estimating Risk Ratios with zEpid
    rr = RiskRatio(reference=1)
    rr.fit(df, exposure='A', outcome='Y')
    
    # Calculating P-values
    est = rr.results['RiskRatio'][1:]
    std = rr.results['SD(RR)'][1:]
    z_score = np.log(est)/std
    p_value = norm.sf(abs(z_score))*2
    
    # Displaying results
    print("RR:     ", list(est))
    print("P-value:", p_value)
    
    

    Which should output the following

    RR:      [1.0266666666666666, 0.7636363636363636]
    P-value: [0.93990517 0.5312407 ]
    

    I generated some generic data rather than use the example data set provided because there is another issue in that data that will result in an error. Below is a 2-by-3 table of the data set

    adhd_parent   0   1
    age_group          
    1            62   0
    2             0  32
    3             0   6
    

    These structural zeroes in the data will through a PositivityError in zEpid. Basically, you can't calculate the risk due to a division by zero (the risk in the referent is 0).