I am trying to compute inter-annotator agreement on a toy example using NLTK's nltk.metrics.agreement
module.
Specifically I am trying to compute agreement using the alpha
metric (Krippendorff) using two different distance metrics(binary_distance
and interval_distance
).
The expected result of toy example 1 below, which has near total agreement (only one pair disagrees), is a value close to 1
. However, in both cases res is 0.0
. Why?
I understand that Krippendorff's alpha is designed for intervals rather than binary-like two-category labels. I would however not expect a zero agreement value back from the module. For the background, the toy example is simply a specific subset of a larger dataset containing annotation scores in the range [1,4]. The subset belongs to a particular population within that dataset.
In toy example 2 things start to look better for the interval alpha. Binary alpha should probably raise an exception given that there are now three labels in the data.
Toy Example 1
from nltk.metrics.agreement import AnnotationTask
from nltk.metrics import interval_distance, binary_distance
annotation_triples = [('coder_1', '1', 4),
('coder_2', '1', 4),
('coder_1', '2', 4),
('coder_2', '2', 4),
('coder_1', '3', 4),
('coder_2', '3', 4),
('coder_1', '4', 4),
('coder_2', '4', 3)]
t = AnnotationTask(annotation_triples, distance=binary_distance)
result = t.alpha()
t = AnnotationTask(annotation_triples, distance=interval_distance)
result = t.alpha()
result binary: 0.0
result interval: 0.0
Toy Example 2 (replaced first pair using 1
instead of 4
)
annotation_triples = [('coder_1', '1', 1),
('coder_2', '1', 1),
('coder_1', '2', 4),
('coder_2', '2', 4),
('coder_1', '3', 4),
('coder_2', '3', 4),
('coder_1', '4', 4),
('coder_2', '4', 3)]
result binary: 0.59
result interval: 0.93
answer provided by Klaus Krippendorff
I do not know the NLTK implementation of alpha. It does not seem to be wrong from what you reproduced.
To clarify, α is not based on the interval metric difference. The interval metric difference functions is only one of many versions. It responds to meaningful algebraic differences, absent in nominal categories.
Incidentally, when you have binary data all metric differences should produce the same results as just two values are either same or different.
Let me focus on the two numerical examples you gave of 2 coders coding 4 units. The coincidence matrix (which tabulates the sum of all possible pairs of valued within units) sums to n=8 not 10 in your calculations. They look like:
Yes, as the variance converges to zero so does alpha. In your 1st example there is virtually no variance and the only deviation from uniformity is a disagreement. The data cannot possible be relied upon for computing correlations, testing statistical hypotheses, providing information about phenomena of interest to answer research questions. etc. If the annotations were without variation whatsoever, the reliability data would not be able to assure you whether coders were asleep, decided to code everything alike so as to achieve 100% agreement, the instrument they used was broken. Data need variations.
In the 2nd example you do have a larger variance. Whether you calculate alpha with the nominal or the interval metric, the reliabilities have to be higher.