Inter-rater reliability calculation for multi-raters data

I have the following list of lists:

[[1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 3, 0, 0, 1],
 [1, 1, 1, 1, 2, 0, 0, 1],
 [1, 1, 0, 2, 3, 1, 0, 1]]

Where I want to calculate an inter-rater reliability score, there are multiple raters(rows). I cannot use Fleiss' kappa, since the rows do not sum to the same number. What is a good approach in this case?

Solution

Yes, data preparation is key here. Let's walk through it together.

While Krippendorff's alpha may be superior for any number of reasons, numpy and statsmodels provide everything you need to get Fleiss kappa from the above mentioned table. Fleiss' kappa is more prevalent in medical research despite Krippendorff alpha delivering mostly the same result if used correctly. If they deliver substantially different results this might be due to a number of user errors, most importantly format of input data and level of measurement (eg. ordinal vs. nominal) – skip ahead for the solution (transpose & aggregate): Fleiss kappa 0.845

pay close attention to which axis represents subject, rater or category !

Fleiss' kappa

statsmodels.stats import inter_rater as irr

The original data had raters as rows and subjects as columns with the integers representing the assigned categories (if I'm not mistaken).

I removed one row because there were 4 rows and 4 categories which may confuse the situation – so now we have 4 [0,1,2,3] categories and 3 rows.

orig = [[1, 1, 1, 1, 3, 0, 0, 1],
        [1, 1, 1, 1, 3, 0, 0, 1],
        [1, 1, 1, 1, 2, 0, 0, 1]]

From the documentation of the aggregate_raters() function

"convert raw data with shape (subject, rater) to (subject, cat_counts)"

irr.aggregate_raters(orig)

This returns:

(array([[2, 5, 0, 1],
        [2, 5, 0, 1],
        [2, 5, 1, 0]]),
array([0, 1, 2, 3]))

now… the number of rows in the orig array is equal to the number of rows in the first of the returned arrays (3). The number of columns is now equal to the number of categories ([0,1,2,3] -> 4). The contents of each row add up to 8, which equals the number of columns in the orig input data – assuming every rater rated every subject. This aggregation shows how the raters are distributed across the categories (columns) for each subject (row). (If agreement was perfect on category 2 we would see [0,0,8,0]; or category 0 [8,0,0,0].

The function expects the rows to be subjects. See how the number of subjects has not changed (3 rows). And for each subject it counted how many times each category was assigned by 'looking' how many times the category (number) is found in the row. For the first row or category 0 was assigned twice, 1 five times, 2 none, 3 once

[1, 1, 1, 1, 3, 0, 0, 1] -> [2, 5, 0, 1]

The second array returns the category values. If we replace both 3s in the input array with 9s the distribution looks the same but the last category has changed.

ori9 = [[1, 1, 1, 1, 9, 0, 0, 1],
        [1, 1, 1, 1, 9, 0, 0, 1],
        [1, 1, 1, 1, 2, 0, 0, 1]]

(array([[2, 5, 0, 1],
        [2, 5, 0, 1],
        [2, 5, 1, 0]]),
array([1, 2, ,3, 9]))      <- categories

aggregate_raters() returns a tuple of ([data], [categories])

In the [data] the rows stay subjects. aggregate_raters() turns columns from raters into categories. Fleiss' expects the 'table' data to be in this (subject, category) format: https://en.wikipedia.org/wiki/Fleiss'_kappa#Data

Now to the solution of the problem:

What happens if we plug the original data into Fleiss kappa: (we just use the data 'dats' not the category list 'cats')

dats, cats = irr.aggregate_raters(orig)
irr.fleiss_kappa(dats, method='fleiss')

-0.12811059907834096

But... why? Well, look at the orig data – aggregate_raters() is assuming raters as columns ! This means that we have perfect disagreement e.g. between the first column and the second to last column – Fleiss thinks: "first rater always rated "1" and second to last always rated "0" -> perfect disagreement on all three subjects.

So what we need to do is (sorry I'm a noob – might not be the most elegant):

giro = np.array(orig).transpose()
giro

array([[1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [1, 1, 1],
       [3, 3, 2],
       [0, 0, 0],
       [0, 0, 0],
       [1, 1, 1]])

Now we have subjects as rows and raters as columns (three raters assigning 4 categories). What happens if we plug this in into the aggregate_raters() function and feed the resulting data into fleiss ? (using index 0 to grab first part of returned tuple)

irr.fleiss_kappa(irr.aggregate_raters(giro)[0], method='fleiss')

0.8451612903225807

Finally … this makes more sense, if all three raters agreed perfectly except on subject 5 [3, 3, 2].

Krippendorff's alpha

The current krippendorff implementation expects the data in the orig format with raters as rows and columns as subjects – no aggregation function needed to prepare the data. So I can see how this was the simpler solution. Fleiss is still very prevalent in medical research, so lets see how it compares:

import krippendorff as kd
kd.alpha(orig)

0.9359

Wow… that's a lot higher than Fleiss' kappa... Well, we need to tell Krippendorff the "Steven's level of measurement of the variable. It must be one of 'nominal', 'ordinal', 'interval', 'ratio' or a callable." – this is for the 'difference function' of Krippendorff's alpha. https://repository.upenn.edu/cgi/viewcontent.cgi?article=1043&context=asc_papers

kd.alpha(orig, level_of_measurement='nominal')

0.8516

Hope this helps, I learned a lot writing this.