Using python to compute relative risk (risk ratio) from a dataframe with support of the zepid package (simulate the riskratio from epitools r pack.)

I'm coming from this post and this website of the zEpid package. My goal is to compute the relative risk (aka risk ratio) from a pd.DataFrame. The independent variable has three levels (1,2, and 3) and my dependent variable (target) has two levels (1 and 0).

When I adapt the following code

import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio


# calculating p-value
est= rr.results['RiskRatio'][1]
std = rr.results['SD(RR)'][1]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2

Python is returning:

    106         # Getting unique values and dropping reference
    107         vals = set(df[exposure].dropna().unique())
--> 108         vals.remove(self.reference)
    109         self._c = df.loc[(df[exposure] == self.reference) & (df[outcome] == 1)].shape[0]
    110         self._d = df.loc[(df[exposure] == self.reference) & (df[outcome] == 0)].shape[0]

KeyError: 0

Any help is appreciated.

The dataframe is below

df = pd.DataFrame(
{'age_group': {2759: 1, 3852: 1, 631: 1, 5152: 2, 5334: 2, 6364: 2, 5652: 2, 1636: 1, 2869: 1, 4654: 2, 1888: 1, 247: 1, 6699: 1, 6471: 2, 1760: 1, 4182: 1, 6095: 1, 48: 1, 6348: 1, 5129: 2, 4920: 2, 4590: 1, 5892: 2, 5131: 2, 1649: 2, 5940: 2, 3960: 1, 3060: 2, 4852: 1, 4605: 1, 3475: 1, 4406: 1, 1958: 1, 2170: 2, 6478: 1, 5328: 2, 4063: 3, 6827: 1, 5085: 1, 5155: 1, 4879: 2, 3185: 1, 32: 1, 4690: 1, 4109: 1, 4617: 1, 1048: 2, 747: 1, 995: 1, 6454: 1, 3302: 3, 5984: 2, 1127: 2, 2165: 2, 2025: 1, 4985: 2, 227: 3, 5802: 1, 4623: 1, 438: 1, 4401: 3, 7099: 1, 1149: 1, 6772: 1, 5567: 1, 873: 2, 2957: 1, 7060: 1, 4206: 2, 5239: 1, 1557: 1, 6080: 1, 411: 2, 2139: 1, 2408: 2, 1189: 2, 3295: 3, 4728: 3, 2490: 1, 4147: 1, 6768: 1, 6810: 1, 2901: 1, 3981: 2, 4941: 1, 3879: 2, 5819: 1, 6662: 2, 1589: 2, 6170: 1, 4522: 1, 552: 2, 5270: 1, 2722: 2, 34: 1, 5193: 1, 5767: 1, 2670: 1, 3298: 1, 5542: 1}, 'adhd_parent': {2759: 0, 3852: 0, 631: 0, 5152: 1, 5334: 1, 6364: 1, 5652: 1, 1636: 0, 2869: 0, 4654: 1, 1888: 0, 247: 0, 6699: 0, 6471: 1, 1760: 0, 4182: 0, 6095: 0, 48: 0, 6348: 0, 5129: 1, 4920: 1, 4590: 0, 5892: 1, 5131: 1, 1649: 1, 5940: 1, 3960: 0, 3060: 1, 4852: 0, 4605: 0, 3475: 0, 4406: 0, 1958: 0, 2170: 1, 6478: 0, 5328: 1, 4063: 1, 6827: 0, 5085: 0, 5155: 0, 4879: 1, 3185: 0, 32: 0, 4690: 0, 4109: 0, 4617: 0, 1048: 1, 747: 0, 995: 0, 6454: 0, 3302: 1, 5984: 1, 1127: 1, 2165: 1, 2025: 0, 4985: 1, 227: 1, 5802: 0, 4623: 0, 438: 0, 4401: 1, 7099: 0, 1149: 0, 6772: 0, 5567: 0, 873: 1, 2957: 0, 7060: 0, 4206: 1, 5239: 0, 1557: 0, 6080: 0, 411: 1, 2139: 0, 2408: 1, 1189: 1, 3295: 1, 4728: 1, 2490: 0, 4147: 0, 6768: 0, 6810: 0, 2901: 0, 3981: 1, 4941: 0, 3879: 1, 5819: 0, 6662: 1, 1589: 1, 6170: 0, 4522: 0, 552: 1, 5270: 0, 2722: 1, 34: 0, 5193: 0, 5767: 0, 2670: 0, 3298: 0, 5542: 0}}
)

Solution

The error is a result of how RiskRatio is parsing your input data set behind the scenes.

When using RiskRatio, the default reference category is set to 0. So, when you independent variable is being processed internally, zEpid is looking for age_group=0. However, there are no instances of 0 in your data set.

To fix this, you can specify the optional argument reference. By default reference=0 but you can set it to 1, which will set age_group=1 as the reference risk for the risk ratio.

The following is a simple example with some simulated data with 'A' and 'Y'

import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio

np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=100)
df['Y'] = np.random.binomial(n=1, p=0.25, size=100)

# Generating some generic data
np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=80)           # Note: A \in {1,2,3}
df['Y'] = np.random.binomial(n=1, p=0.25, size=80)   # Note: Y \in {0,1}

# Estimating Risk Ratios with zEpid
rr = RiskRatio(reference=1)
rr.fit(df, exposure='A', outcome='Y')

# Calculating P-values
est = rr.results['RiskRatio'][1:]
std = rr.results['SD(RR)'][1:]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2

# Displaying results
print("RR:     ", list(est))
print("P-value:", p_value)

Which should output the following

RR:      [1.0266666666666666, 0.7636363636363636]
P-value: [0.93990517 0.5312407 ]

I generated some generic data rather than use the example data set provided because there is another issue in that data that will result in an error. Below is a 2-by-3 table of the data set

adhd_parent   0   1
age_group          
1            62   0
2             0  32
3             0   6

These structural zeroes in the data will through a PositivityError in zEpid. Basically, you can't calculate the risk due to a division by zero (the risk in the referent is 0).