38 lines
1.9 KiB
Markdown
38 lines
1.9 KiB
Markdown
---
|
|
title: Dataset Splitting
|
|
---
|
|
## Dataset Splitting
|
|
|
|
Splitting up into Training, Cross Validation, and Test sets are common best practices.
|
|
This allows you to tune various parameters of the algorithm without making judgements that specifically conform to training data.
|
|
|
|
### Motivation
|
|
|
|
Dataset Splitting emerges as a necessity to eliminate bias to training data in ML algorithms.
|
|
Modifying parameters of a ML algorithm to best fit the training data commonly results in an overfit algorithm that performs poorly on actual test data.
|
|
For this reason, we split the dataset into multiple, discrete subsets on which we train different parameters.
|
|
|
|
#### The Training Set
|
|
|
|
The Training set is used to compute the actual model your algorithm will use when exposed to new data.
|
|
This dataset is typically 60%-80% of your entire available data (depending on whether or not you use a Cross Validation set).
|
|
|
|
#### The Cross Validation Set
|
|
|
|
Cross Validation sets are for model selection (typically ~20% of your data).
|
|
Use this dataset to try different parameters for the algorithm as trained on the Training set.
|
|
For example, you can evaluate differnt model parameters (polynomial degree or lambda, the regularization parameter) on the Cross Validation set to see which may be most accurate.
|
|
|
|
#### The Test Set
|
|
|
|
The Test set is the final dataset you touch (typically ~20% of your data).
|
|
It is the source of truth.
|
|
Your accuracy in predicting the test set is the accuracy of your ML algorithm.
|
|
|
|
#### More Information:
|
|
- [AWS ML Doc](http://docs.aws.amazon.com/machine-learning/latest/dg/splitting-the-data-into-training-and-evaluation-data.html)
|
|
- [A good stackoverflow post](https://stackoverflow.com/questions/13610074/is-there-a-rule-of-thumb-for-how-to-divide-a-dataset-into-training-and-validatio)
|
|
- [Academic Paper](https://www.mff.cuni.cz/veda/konference/wds/proc/pdf10/WDS10_105_i1_Reitermanova.pdf)
|
|
|
|
|