Skip to main content

HistGradientBoostingClassifier

Histogram-based gradient boosting: features are binned up-front so split finding scans bin histograms rather than raw values, making training near-linear in the sample count.

Algorithm

Each feature is quantile-binned into at most max_bins buckets. Split finding then operates on per-bin gradient/hessian histograms with a second-order (Newton) split-gain criterion, grown leaf-wise and bounded by max_leaf_nodes. The grower supports L2 regularisation, per-feature monotonic constraints, and holdout-based early stopping.

Binary problems boost a single log-odds score FF with gradient gi=σ(Fi)yig_i = \sigma(F_i) - y_i and hessian hi=σ(Fi)(1σ(Fi))h_i = \sigma(F_i)(1 - \sigma(F_i)). Multiclass problems boost one tree per class per iteration against the softmax cross-entropy gradient gi,k=pi,k1[yi=k]g_{i,k} = p_{i,k} - \mathbb{1}[y_i = k], with predictions normalised by softmax.

Constructor

Skigen::HistGradientBoostingClassifier<Scalar> model(
Loss loss = Loss::LogLoss,
Scalar learning_rate = 0.1,
int max_iter = 100,
std::optional<int> max_leaf_nodes = 31,
std::optional<int> max_depth = std::nullopt,
int min_samples_leaf = 20,
Scalar l2_regularization = 0.0,
int max_bins = 255,
std::optional<std::vector<int>> monotonic_cst = std::nullopt,
bool early_stopping = false,
Scalar validation_fraction = 0.1,
int n_iter_no_change = 10,
Scalar tol = 1e-7,
std::optional<uint64_t> random_state = std::nullopt);

Parameters

ParameterDefaultDescription
learning_rate0.1Shrinkage per iteration.
max_iter100Number of boosting iterations.
max_leaf_nodes31Leaf-wise growth bound (nullopt = unbounded).
min_samples_leaf20Minimum samples per leaf.
l2_regularization0.0L2 penalty on the Newton step.
max_bins255Feature quantisation resolution (2–255).
monotonic_cstnulloptPer-feature +1 / -1 / 0 constraint.
early_stoppingfalseEnable holdout-based stopping.
validation_fraction0.1Holdout size for early stopping.
n_iter_no_change10Patience before stopping.
random_statenulloptSeed for the holdout split.

Both binary and multiclass log-loss are supported.

Methods

MethodDescription
fit(X, y)Bin features, then boost.
predict(X)Class labels.
predict_proba(X)Class probabilities (sigmoid for binary, softmax for multiclass).
decision_function(X)Raw scores: (n,) log-odds for binary, (n, K) for multiclass.
score(X, y)Mean accuracy.

Fitted Attributes

AccessorDescription
bin_edges()Per-feature quantile bin edges.
train_score()Per-iteration training log-loss.

Example

Skigen::HistGradientBoostingClassifier<double> gb;
gb.fit(X, y);
auto preds = gb.predict(X_test);
Verified against scikit-learn

This estimator is checked by the parity suite. See the generator tests/parity/generate_ensemble_reference.py and the reference fixtures in tests/parity/data/hist_gradient_boosting_classifier/, exercised by tests/parity/parity_ensemble.cpp.

API Reference