Model Selection
Utilities for splitting data into train/test sets and evaluating models via cross-validation.
train_test_split
Splits the data arrays and into random train and test subsets. Given samples and test fraction :
When shuffle = true (default), the data is randomly permuted before splitting using the Fisher–Yates shuffle with the given random_state seed. This ensures reproducibility.
#include <Skigen/ModelSelection>
auto [X_train, X_test, y_train, y_test] = Skigen::train_test_split(
X, y,
/*test_size=*/0.25,
/*random_state=*/42,
/*shuffle=*/true
);
| Parameter | Default | Description |
|---|---|---|
test_size | 0.25 | Fraction of data reserved for testing |
random_state | 42 | Random seed for reproducibility |
shuffle | true | Whether to shuffle before splitting |
cross_val_score
Evaluates a model using -fold cross-validation. The data is split into non-overlapping folds. For each fold :
- Train the model on all folds except fold .
- Score the model on fold .
The function returns a vector of scores. The mean score estimates the model's generalization performance.
#include <Skigen/ModelSelection>
Skigen::LinearRegression model;
auto scores = Skigen::cross_val_score(model, X, y, /*cv=*/5);
std::cout << "Mean R²: " << scores.mean() << "\n";
std::cout << "Std: " << std::sqrt((scores.array() - scores.mean()).square().mean()) << "\n";
| Parameter | Default | Description |
|---|---|---|
cv | 5 | Number of folds |
shuffle | true | Whether to shuffle before folding |
random_state | 42 | Random seed for reproducibility |
Mirrors sklearn.model_selection.train_test_split and cross_val_score.