machine-learning nlp data-science evaluation summarization

In the ROUGE metrics, what do the low, mid and high values mean?

The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an AggregateScore object with 3 parameters: low, mid, high. How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0

Edit: On July 7th, the huggingface implementation was simplified to return a cleaner and easier to understand dict: https://github.com/huggingface/evaluate/issues/148

Solution

Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False and get these values returned.

For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n samples, you draw x times a sample with replacement of size n, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution, which can be used to extract confidence intervals.

In the ROUGE implementation by google [4], they used:

n for the number of resamples to run
mean for the resample statistic
2.5th, 50th and 97.5th percentiles to calculate the values for low, mid and high, respectively (can be controlled with the confidence_interval param)

Randomness in ROUGE

Note that due to the bootstrapping technique used in ROUGE, it is non-deterministic, and can return different results for each run (see [5]). If you don't want to opt out from using the bootstrapping technique, you can set the seed in the load function, as such: evaluate.load('rouge', seed=42).