I'm coming from this post and this website of the zEpid package. My goal is to compute the relative risk (aka risk ratio) from a pd.DataFrame. The independent variable has three levels (1,2, and 3) and my dependent variable (target) has two levels (1 and 0).
When I adapt the following code
import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio
# calculating p-value
est= rr.results['RiskRatio'][1]
std = rr.results['SD(RR)'][1]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2
Python is returning:
106 # Getting unique values and dropping reference
107 vals = set(df[exposure].dropna().unique())
--> 108 vals.remove(self.reference)
109 self._c = df.loc[(df[exposure] == self.reference) & (df[outcome] == 1)].shape[0]
110 self._d = df.loc[(df[exposure] == self.reference) & (df[outcome] == 0)].shape[0]
KeyError: 0
Any help is appreciated.
The dataframe is below
df = pd.DataFrame(
{'age_group': {2759: 1, 3852: 1, 631: 1, 5152: 2, 5334: 2, 6364: 2, 5652: 2, 1636: 1, 2869: 1, 4654: 2, 1888: 1, 247: 1, 6699: 1, 6471: 2, 1760: 1, 4182: 1, 6095: 1, 48: 1, 6348: 1, 5129: 2, 4920: 2, 4590: 1, 5892: 2, 5131: 2, 1649: 2, 5940: 2, 3960: 1, 3060: 2, 4852: 1, 4605: 1, 3475: 1, 4406: 1, 1958: 1, 2170: 2, 6478: 1, 5328: 2, 4063: 3, 6827: 1, 5085: 1, 5155: 1, 4879: 2, 3185: 1, 32: 1, 4690: 1, 4109: 1, 4617: 1, 1048: 2, 747: 1, 995: 1, 6454: 1, 3302: 3, 5984: 2, 1127: 2, 2165: 2, 2025: 1, 4985: 2, 227: 3, 5802: 1, 4623: 1, 438: 1, 4401: 3, 7099: 1, 1149: 1, 6772: 1, 5567: 1, 873: 2, 2957: 1, 7060: 1, 4206: 2, 5239: 1, 1557: 1, 6080: 1, 411: 2, 2139: 1, 2408: 2, 1189: 2, 3295: 3, 4728: 3, 2490: 1, 4147: 1, 6768: 1, 6810: 1, 2901: 1, 3981: 2, 4941: 1, 3879: 2, 5819: 1, 6662: 2, 1589: 2, 6170: 1, 4522: 1, 552: 2, 5270: 1, 2722: 2, 34: 1, 5193: 1, 5767: 1, 2670: 1, 3298: 1, 5542: 1}, 'adhd_parent': {2759: 0, 3852: 0, 631: 0, 5152: 1, 5334: 1, 6364: 1, 5652: 1, 1636: 0, 2869: 0, 4654: 1, 1888: 0, 247: 0, 6699: 0, 6471: 1, 1760: 0, 4182: 0, 6095: 0, 48: 0, 6348: 0, 5129: 1, 4920: 1, 4590: 0, 5892: 1, 5131: 1, 1649: 1, 5940: 1, 3960: 0, 3060: 1, 4852: 0, 4605: 0, 3475: 0, 4406: 0, 1958: 0, 2170: 1, 6478: 0, 5328: 1, 4063: 1, 6827: 0, 5085: 0, 5155: 0, 4879: 1, 3185: 0, 32: 0, 4690: 0, 4109: 0, 4617: 0, 1048: 1, 747: 0, 995: 0, 6454: 0, 3302: 1, 5984: 1, 1127: 1, 2165: 1, 2025: 0, 4985: 1, 227: 1, 5802: 0, 4623: 0, 438: 0, 4401: 1, 7099: 0, 1149: 0, 6772: 0, 5567: 0, 873: 1, 2957: 0, 7060: 0, 4206: 1, 5239: 0, 1557: 0, 6080: 0, 411: 1, 2139: 0, 2408: 1, 1189: 1, 3295: 1, 4728: 1, 2490: 0, 4147: 0, 6768: 0, 6810: 0, 2901: 0, 3981: 1, 4941: 0, 3879: 1, 5819: 0, 6662: 1, 1589: 1, 6170: 0, 4522: 0, 552: 1, 5270: 0, 2722: 1, 34: 0, 5193: 0, 5767: 0, 2670: 0, 3298: 0, 5542: 0}}
)
The error is a result of how RiskRatio
is parsing your input data set behind the scenes.
When using RiskRatio
, the default reference category is set to 0
. So, when you independent variable is being processed internally, zEpid is looking for age_group=0
. However, there are no instances of 0
in your data set.
To fix this, you can specify the optional argument reference
. By default reference=0
but you can set it to 1
, which will set age_group=1
as the reference risk for the risk ratio.
The following is a simple example with some simulated data with 'A'
and 'Y'
import numpy as np
import pandas as pd
from scipy.stats import norm
from zepid import RiskRatio
np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=100)
df['Y'] = np.random.binomial(n=1, p=0.25, size=100)
# Generating some generic data
np.random.seed(20220120)
df = pd.DataFrame()
df['A'] = np.random.randint(1, 4, size=80) # Note: A \in {1,2,3}
df['Y'] = np.random.binomial(n=1, p=0.25, size=80) # Note: Y \in {0,1}
# Estimating Risk Ratios with zEpid
rr = RiskRatio(reference=1)
rr.fit(df, exposure='A', outcome='Y')
# Calculating P-values
est = rr.results['RiskRatio'][1:]
std = rr.results['SD(RR)'][1:]
z_score = np.log(est)/std
p_value = norm.sf(abs(z_score))*2
# Displaying results
print("RR: ", list(est))
print("P-value:", p_value)
Which should output the following
RR: [1.0266666666666666, 0.7636363636363636]
P-value: [0.93990517 0.5312407 ]
I generated some generic data rather than use the example data set provided because there is another issue in that data that will result in an error. Below is a 2-by-3 table of the data set
adhd_parent 0 1
age_group
1 62 0
2 0 32
3 0 6
These structural zeroes in the data will through a PositivityError
in zEpid. Basically, you can't calculate the risk due to a division by zero (the risk in the referent is 0).