Search code examples
machine-learningnlpdata-scienceevaluationsummarization

In the ROUGE metrics, what do the low, mid and high values mean?


The ROUGE metrics were introduced to "automatically determine the quality of a summary by comparing it to other (ideal) summaries created by humans" [1].

When calculating any ROUGE metric you get an AggregateScore object with 3 parameters: low, mid, high. How are these aggregate values calculated?

For example, from the huggingface implementation [2]:

>>> rouge = evaluate.load('rouge')
>>> predictions = ["hello there", "general kenobi"]
>>> references = ["hello there", "general kenobi"]
>>> results = rouge.compute(predictions=predictions,
...                         references=references)
>>> print(list(results.keys()))
['rouge1', 'rouge2', 'rougeL', 'rougeLsum']
>>> print(results["rouge1"])
AggregateScore(low=Score(precision=1.0, recall=1.0, fmeasure=1.0), mid=Score(precision=1.0, recall=1.0, fmeasure=1.0), high=Score(precision=1.0, recall=1.0, fmeasure=1.0))
>>> print(results["rouge1"].mid.fmeasure)
1.0

Edit: On July 7th, the huggingface implementation was simplified to return a cleaner and easier to understand dict: https://github.com/huggingface/evaluate/issues/148


Solution

  • Given a list of (summary, gold_summary) pairs, any ROUGE metric is calculated per each item in the list. In huggingface, you can opt-out of the aggregation part by adding use_aggregator=False and get these values returned.

    For the aggregation, a bootstrap resampling is used [1, 2]. Bootstrap resampling is a technique used to extract confidence intervals [3, 4]. The idea is that for n samples, you draw x times a sample with replacement of size n, and then calculate some statistic for each resample. Now you get a new distribution called the empirical bootstrap distribution, which can be used to extract confidence intervals.

    In the ROUGE implementation by google [4], they used:

    • n for the number of resamples to run
    • mean for the resample statistic
    • 2.5th, 50th and 97.5th percentiles to calculate the values for low, mid and high, respectively (can be controlled with the confidence_interval param)

    Randomness in ROUGE

    Note that due to the bootstrapping technique used in ROUGE, it is non-deterministic, and can return different results for each run (see [5]). If you don't want to opt out from using the bootstrapping technique, you can set the seed in the load function, as such: evaluate.load('rouge', seed=42).