PCA

Principal Component Analysis finds orthogonal directions of maximum variance in the data, enabling dimensionality reduction while retaining as much information as possible.

The examples/pca_clustering_workflow.cpp program projects 10-D Gaussian clusters down to 2-D with PCA and recovers them with KMeans, rendered via SkigenPlot:

10-D Gaussian clusters projected to 2-D by Skigen::PCA and grouped by Skigen::KMeans

Algorithm

Given an $n \times p$ data matrix $X$ :

Center the data: $X_c = X - \mathbf{1}\bar{x}^\top$ , where $\bar{x}$ is the column-wise mean.
Compute the SVD of $X_c$ : $X_c = U \Sigma V^\top$ .
Truncate to $k$ components by keeping only the first $k$ columns of $V$ (right singular vectors).

The projection onto the $k$ -dimensional subspace is:

Z = X_c \, V_k \in \mathbb{R}^{n \times k}

Explained Variance

The explained variance of the $j$ -th component is derived from the singular values:

\text{explained\_variance}_j = \frac{\sigma_j^2}{n - 1}

using $n-1$ degrees of freedom (Bessel's correction), consistent with scikit-learn. The explained variance ratio measures the proportion of total variance captured by each component:

\text{explained\_variance\_ratio}_j = \frac{\sigma_j^2}{\sum_{i=1}^{p} \sigma_i^2}

SVD Solvers

The svd_solver parameter selects how the SVD is computed:

"full" (default) — exact decomposition via Eigen::JacobiSVD. Best for small to medium dense data where exactness matters.
"randomized" — the Halko-Martinsson-Tropp randomized range finder. Draws a Gaussian projection of width n_components + n_oversamples, runs n_iter QR-stabilised power iterations, then performs a small dense SVD. Much faster than the full SVD when only the top components are needed.

Sparse Input

PCA supports sparse matrices natively via implicit centering. The data is mean-centered through a linear operator

(X - \mathbf{1}\mu)\,M = X M - \mathbf{1}(\mu M),

so the sparse matrix is never materialised dense. Sparse fitting always uses the randomized solver — this mirrors scikit-learn, where explicitly centering a sparse matrix would destroy its sparsity. The per-feature mean is computed directly from the sparse column sums.

Key Properties

PCA always centers the data before decomposition. For data that should not be centered (e.g., TF-IDF), use TruncatedSVD instead.
The components are ordered by decreasing explained variance.
inverse_transform reconstructs an approximation: $\hat{X} = Z V_k^\top + \mathbf{1}\bar{x}^\top$ .

Mirrors sklearn.decomposition.PCA.

Constructor

Skigen::PCA<Scalar> pca(
    Eigen::Index n_components = 0,
    std::string svd_solver = "full",
    int n_oversamples = 10,
    int n_iter = 5,
    std::optional<uint64_t> random_state = std::nullopt);

Parameter	Default	Description
`n_components`	`0`	Number of components to keep ( $0$ = all)
`svd_solver`	`"full"`	`"full"` (exact) or `"randomized"`
`n_oversamples`	`10`	Extra random dimensions for the randomized solver
`n_iter`	`5`	Power iterations for the randomized solver
`random_state`	`nullopt`	Seed for the randomized solver

Methods

Method	Description
`fit(X)`	Compute the SVD of the centered dense data (`full` or `randomized`)
`fit(X_sparse)`	Native sparse fit via implicit centering (randomized)
`transform(X)`	Project $X$ onto the principal components
`fit_transform(X)`	Fit and project in one call
`inverse_transform(Z)`	Reconstruct from the reduced representation

Fitted Attributes

Accessor	Type	Description
`components()`	`MatrixType`	Principal axes (rows = components)
`explained_variance()`	`VectorType`	Variance explained by each component ( $\sigma_j^2 / (n-1)$ )
`explained_variance_ratio()`	`VectorType`	Fraction of total variance per component
`singular_values()`	`VectorType`	Singular values $\sigma_1 \ge \sigma_2 \ge \cdots$
`mean()`	`RowVectorType`	Per-feature mean $\bar{x}$

Example

#include <Skigen/Decomposition>
#include <Eigen/Dense>
#include <iostream>

int main() {
    Eigen::MatrixXd X = Eigen::MatrixXd::Random(100, 10);

    Skigen::PCA pca(3);  // keep 3 components
    pca.fit(X);

    Eigen::MatrixXd X_reduced = pca.transform(X);  // 100 x 3
    std::cout << "Explained variance ratio: "
              << pca.explained_variance_ratio().transpose() << "\n";

    // Approximate reconstruction
    Eigen::MatrixXd X_approx = pca.inverse_transform(X_reduced);
}

Algorithm​

Explained Variance​

SVD Solvers​

Sparse Input​

Key Properties​

Constructor​

Methods​

Fitted Attributes​

Example​