Skip to main content

Model Selection

Utilities for splitting data into train/test sets and evaluating models via cross-validation.

train_test_split

Splits the data arrays XX and yy into random train and test subsets. Given nn samples and test fraction ff:

ntest=nf,ntrain=nntestn_{\text{test}} = \lfloor n \cdot f \rfloor, \qquad n_{\text{train}} = n - n_{\text{test}}

When shuffle = true (default), the data is randomly permuted before splitting using the Fisher–Yates shuffle with the given random_state seed. This ensures reproducibility.

#include <Skigen/ModelSelection>

auto [X_train, X_test, y_train, y_test] = Skigen::train_test_split(
X, y,
/*test_size=*/0.25,
/*random_state=*/42,
/*shuffle=*/true
);
ParameterDefaultDescription
test_size0.25Fraction of data reserved for testing
random_state42Random seed for reproducibility
shuffletrueWhether to shuffle before splitting

cross_val_score

Evaluates a model using KK-fold cross-validation. The data is split into KK non-overlapping folds. For each fold kk:

  1. Train the model on all folds except fold kk.
  2. Score the model on fold kk.

The function returns a vector of KK scores. The mean score estimates the model's generalization performance.

#include <Skigen/ModelSelection>

Skigen::LinearRegression model;
auto scores = Skigen::cross_val_score(model, X, y, /*cv=*/5);
std::cout << "Mean R²: " << scores.mean() << "\n";
std::cout << "Std: " << std::sqrt((scores.array() - scores.mean()).square().mean()) << "\n";
ParameterDefaultDescription
cv5Number of folds KK
shuffletrueWhether to shuffle before folding
random_state42Random seed for reproducibility

Mirrors sklearn.model_selection.train_test_split and cross_val_score.