An ML team is trying to ensure the reliability of the evaluation figure used in the model's final report. Which dataset is held back until the end, used neither for training nor for hyperparameter tuning, and used to evaluate the model's final generalization performance?

1 / 1
Select an answer
CorrectB

Explanation

A question about choosing the dataset for evaluating final generalization performance.

  • 1used neither for training nor for hyperparameter tuningNot used for either training or tuning
  • 2to evaluate the model's final generalization performanceFor the final evaluation = the test set
AIncorrect

The training (train) set

The training set is data for training the model's weights.

It is not the data held back for final evaluation, so this is incorrect.

BCorrect

The test (test) set

Correct. The test set is the data held back to the very end, used neither for training nor for hyperparameter tuning, and used once to evaluate the final generalization performance.

CIncorrect

The validation (validation) set

The validation set is data used during training for hyperparameter tuning and model selection.

It is not the data held back for final evaluation, so this is incorrect.

DIncorrect

A sampled subset of the training data

A subset of the training data is data the model saw during training, even if it is sampled.

Evaluation on data it has already seen becomes optimistic and cannot be used for final evaluation, so this is incorrect.

Key Takeaway

Note the correct answer, the test (test) set.
- Data held back to the very end, used neither for training nor for hyperparameter tuning, and used once to evaluate the final generalization performance.
- It reveals unbiased performance close to production.
The training set (training the weights), the validation set (comparing settings during training), and the feature store (storing features) are all not the dataset for final evaluation.