machine-learningnlpdata-scienceevaluationsummarization

# In the ROUGE metrics, what do the low, mid and high values mean?

The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an `AggregateScore` object with 3 parameters: `low`, `mid`, `high`. How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

``````>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0
``````

Edit: On July 7th, the huggingface implementation was simplified to return a cleaner and easier to understand dict: https://github.com/huggingface/evaluate/issues/148

Solution

• Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding `use_aggregator=False` and get these values returned.

For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for `n` samples, you draw `x` times a sample with replacement of size `n`, and then calculate some statistic for each resample. Now you get a new distribution called the `empirical bootstrap distribution`, which can be used to extract confidence intervals.

In the ROUGE implementation by google [4], they used:

• `n` for the number of resamples to run
• `mean` for the resample statistic
• `2.5th, 50th and 97.5th percentiles` to calculate the values for low, mid and high, respectively (can be controlled with the `confidence_interval` param)

## Randomness in ROUGE

Note that due to the bootstrapping technique used in ROUGE, it is non-deterministic, and can return different results for each run (see [5]). If you don't want to opt out from using the bootstrapping technique, you can set the seed in the load function, as such: `evaluate.load('rouge', seed=42)`.