ROUGE = summarization quality evaluation
Correct. ROUGE is a metric that measures how much the generated summary overlaps with a reference summary and is mainly used for quality evaluation of summarization tasks.
An AI team is choosing metrics to automatically evaluate the quality of summarization and translation. Which mappings of automatic text-generation evaluation metrics to their representative target tasks are correct? (Choose TWO.)
Choosing TWO correct mappings of metric to task.
ROUGE = summarization quality evaluation
Correct. ROUGE is a metric that measures how much the generated summary overlaps with a reference summary and is mainly used for quality evaluation of summarization tasks.
BLEU = machine translation quality evaluation
Correct. BLEU is a metric that measures how much the generated translation matches a reference translation and is mainly used for quality evaluation of machine translation tasks.
ROUGE = machine translation quality evaluation
The mapping is reversed. ROUGE is the metric representatively used for quality evaluation of summarization, and the representative metric for machine translation is BLEU.
Swapping the target tasks of ROUGE and BLEU is a classic trap.
BLEU = summarization quality evaluation
The mapping is reversed. BLEU is the metric representatively used for quality evaluation of machine translation, and the representative metric for summarization is ROUGE.
BERTScore = inference speed evaluation
BERTScore is a metric that evaluates semantic closeness using embeddings.
It is not a speed metric, so the mapping is wrong and incorrect.
Automatic text-generation evaluation metrics differ by what they compare.
・ROUGE: measures the n-gram overlap between a generated summary and a reference summary. Mainly for summarization evaluation.
・BLEU: measures the n-gram match between a generated translation and a reference translation. Mainly for machine translation evaluation.
・BERTScore: measures semantic closeness using embeddings. It scores high when the meaning matches even if the wording differs.
Swapping the target tasks (summarization / translation) of ROUGE and BLEU is a classic trap.