Select the purpose of the train/validation/test …

An ML team is defining quality control rules for model development. In machine learning, which of the following BEST describes the main purpose of splitting data into training (train), validation, and test sets?

1 / 1

Select an answer

CorrectD

Explanation

Question Overview

Select the purpose of the train/validation/test split.

Requirements to satisfy

1「main purpose of splitting data into training (train), validation, and test sets」To fairly evaluate generalization performance on data not used in training

Per-option explanation

AIncorrect

To correct imbalance in the number of records per class

Correcting class imbalance is a separate effort done with sampling or weighting.

The purpose of the three-way split is to fairly measure generalization performance on data not used in training, so it is incorrect.

BIncorrect

To make the training data and test data identical to raise accuracy

Making training and test identical evaluates rote memorization and overstates the result.

This is the exact opposite of the goal of fair evaluation, so it is incorrect.

CIncorrect

To repeat training three times on the same data to raise accuracy

Splitting into three is to divide roles, not to repeat the same training.

Only train is used for training; validation and test are exclusively for evaluation, so it is incorrect.

DCorrect

To fairly estimate generalization performance on unseen data without bias

This is correct. Splitting data into three is to give each a role and fairly measure generalization performance.

- Training data (train): data shown to the model repeatedly to learn its parameters (weights).

- Validation data (validation): data used to test the in-progress model for hyperparameter tuning, model selection, and checking overfitting (not used for training itself).

- Test data (test): data used only once after everything is decided, to measure final generalization performance on unseen data without bias.

Evaluating on data not used in training avoids overstated results due to rote memorization.

Key Takeaway

The main purpose of splitting data into train/validation/test is to 'fairly estimate generalization performance on unseen data without bias.' Evaluating on data not used in training prevents overstated results due to rote memorization. 'Reducing the number of records,' 'making training and test identical,' and 'raising response speed' are not the purpose; in particular, train = test is the exact opposite of fair evaluation.

Explanation

💡Key Takeaway

Related Links

Key Takeaway