主成分分析 - バイオインフォの森

1. 確率・統計と行列
2. 分散共分散行列の性質
3. 主成分分析の導出
4. Pythonによる実装

1. 確率・統計と行列

(1) 残差ベクトル

観測ベクトル $\mathbf{y}\in\mathbb{R}^n$、モデルの予測 $\hat{\mathbf{y}}\in\mathbb{R}^n$ に対して、残差ベクトルは
$$
\mathbf{e}=\mathbf{y}-\hat{\mathbf{y}}
$$
で定義する。$|\mathbf{e}|$ が小さいほど当てはまりが良い。

(2) 分散共分散行列

データ行列 $X\in\mathbb{R}^{n\times p}$（行＝サンプル、列＝変数）。各列を平均0に中心化した後の標本分散共分散行列を
$$
S=\frac{1}{\,n-1\,}X^{\mathsf T}X
$$
と定める。$S_{ii}$ は変数 $i$ の分散、$S_{ij}$ は変数 $i$ と $j$ の共分散。

(3) 相関係数行列

各変数の標準偏差を $d_i=\sqrt{S_{ii}}$ とすると、相関係数行列 $R$ は
$$
R_{ij}=\frac{S_{ij}}{d_i d_j}
$$

(4) マハラノビス距離

平均 $\boldsymbol{\mu}$、共分散 $S$ に対し、点 $\mathbf{x}$ の距離は
$$
d_M(\mathbf{x})=\sqrt{(\mathbf{x}-\boldsymbol{\mu})^{\mathsf T}S^{-1}(\mathbf{x}-\boldsymbol{\mu})}
$$

(5) 一般化分散

全体のばらつきの尺度は
$$
\det(S)
$$

(6) フルランクとランク落ち

$\mathrm{rank}(S)=p$ ならフルランク。小さい場合は線形従属（多重共線性）。

(7) 多重共線性

説明変数がほぼ線形結合になっている現象。$S$ が特異（または縮退）に近い。

2. 分散共分散行列の性質

(1) 分散共分散行列のサイズ

$S\in\mathbb{R}^{p\times p}$ は正方行列。

(2) 分散共分散行列は対称行列

定義から $S_{ij}=S_{ji}$。よって $S=S^{\mathsf T}$。

(3) 分散共分散行列は非負値（半正定値）

任意の $\mathbf{a}\in\mathbb{R}^p$ で
$$
\mathbf{a}^{\mathsf T}S\mathbf{a}=\frac{1}{\,n-1\,}|X\mathbf{a}|^2\ge 0
$$
よって $S$ は半正定値。

3. 主成分分析の導出

(1) 特定の方向の分散（$w$ 方向の分散が $w^{\mathsf T}Sw$ になることの証明）

状況設定：中心化済みデータ $X\in\mathbb{R}^{n\times p}$、方向ベクトル $\mathbf{w}\in\mathbb{R}^p$。各サンプル $\mathbf{x}_i^{\mathsf T}$ の射影（スカラー）を $z_i=\mathbf{x}_i^{\mathsf T}\mathbf{w}$ とする。射影ベクトルを $\mathbf{z}=X\mathbf{w}\in\mathbb{R}^n$ と書ける。

標本分散の定義（平均は 0）：
$$
\mathrm{Var}(\mathbf{z})=\frac{1}{\,n-1\,}\sum_{i=1}^{n}z_i^2=\frac{1}{\,n-1\,}\mathbf{z}^{\mathsf T}\mathbf{z}
$$

$\mathbf{z}=X\mathbf{w}$ を代入して
$$
\mathrm{Var}(\mathbf{z})=\frac{1}{\,n-1\,}\mathbf{w}^{\mathsf T}X^{\mathsf T}X\mathbf{w}=\mathbf{w}^{\mathsf T}\Big(\frac{1}{\,n-1\,}X^{\mathsf T}X\Big)\mathbf{w}=\mathbf{w}^{\mathsf T}S\mathbf{w}
$$

以上で、方向 $\mathbf{w}$ の分散は $\,\mathbf{w}^{\mathsf T}S\mathbf{w}\,$ と等しいことが示せた。これはレイリー商の形であり、後で最大化する。

(2) ラグランジュの未定乗数法（最大分散方向が固有問題に帰着することの証明）

目的：分散 $\mathbf{w}^{\mathsf T}S\mathbf{w}$ を最大化したい。ただし尺度の任意性（$\alpha\mathbf{w}$ にすると分散は $\alpha^2$ 倍）を除くため、制約 $|\mathbf{w}|=1$ を課す。

最適化問題：
$$
\max_{\mathbf{w}}\ \mathbf{w}^{\mathsf T}S\mathbf{w}\quad\text{s.t.}\quad \mathbf{w}^{\mathsf T}\mathbf{w}=1
$$

ラグランジュ関数：
$$
\mathcal{L}(\mathbf{w},\lambda)=\mathbf{w}^{\mathsf T}S\mathbf{w}-\lambda(\mathbf{w}^{\mathsf T}\mathbf{w}-1)
$$

$\mathbf{w}$ について勾配を 0 にする（$\nabla_{\mathbf{w}}\mathcal{L}=\mathbf{0}$）。後述(3)のベクトル解析から
$$
\nabla_{\mathbf{w}}(\mathbf{w}^{\mathsf T}S\mathbf{w})=2S\mathbf{w},\quad \nabla_{\mathbf{w}}(\mathbf{w}^{\mathsf T}\mathbf{w})=2\mathbf{w}
$$
よって
$$
2S\mathbf{w}-2\lambda\mathbf{w}=\mathbf{0}\ \Rightarrow\ S\mathbf{w}=\lambda\mathbf{w}
$$

すなわち、最適方向 $\mathbf{w}$ は $S$ の固有ベクトル、対応する分散は固有値 $\lambda$ に等しい。レイリー商の性質（またはクーラント・フィッシャーの極値原理）から、最大値は最大固有値 $\lambda_1$、そのときの最適解は対応固有ベクトル $\mathbf{w}_1$。同様に、$\mathbf{w}_1$ に直交という制約を加えると 2 番目の固有ベクトルが得られ、以後も同様。

(3) ベクトル解析（勾配の厳密証明）

主張1：
$$
\nabla_{\mathbf{x}}(\mathbf{x}^{\mathsf T}\mathbf{x})=2\mathbf{x}
$$

証明：$\mathbf{x}^{\mathsf T}\mathbf{x}=\sum_{i=1}^{p}x_i^2$。$k$ 成分で偏微分すると $\partial/\partial x_k$ は $2x_k$、したがって勾配ベクトル全体は $[2x_1,\dots,2x_p]^{\mathsf T}=2\mathbf{x}$。

主張2（一般形）：
$$
\nabla_{\mathbf{x}}(\mathbf{x}^{\mathsf T}A\mathbf{x})=(A+A^{\mathsf T})\mathbf{x}
$$

証明：成分で書くと $\mathbf{x}^{\mathsf T}A\mathbf{x}=\sum_{i,j}a_{ij}x_i x_j$。$k$ で偏微分すると
$$
\frac{\partial}{\partial x_k}\sum_{i,j}a_{ij}x_i x_j=\sum_{j}a_{kj}x_j+\sum_{i}a_{ik}x_i
$$
これは第 $k$ 成分が $\big((A+A^{\mathsf T})\mathbf{x}\big)_k$ であることに一致する。したがって主張成立。特に $A$ が対称（$S=S^{\mathsf T}$）なら $$ \nabla_{\mathbf{x}}(\mathbf{x}^{\mathsf T}A\mathbf{x})=2A\mathbf{x}
$$
となる。

(4) 主成分分析（固有値問題としての導出とデータ変換の手順）

導出の核心：

方向 $\mathbf{w}$ に沿う分散は $\mathbf{w}^{\mathsf T}S\mathbf{w}$。
制約 $|\mathbf{w}|=1$ の下で最大化すると $S\mathbf{w}=\lambda\mathbf{w}$ に帰着。
最大分散の主成分軸は、$S$ の最大固有値に対応する固有ベクトル $\mathbf{w}_1$。次の主成分は $\mathbf{w}_1$ に直交という制約の下で最大化し、固有値順に $\mathbf{w}_2,\mathbf{w}_3,\dots$ が得られる（互いに直交）。

固有値分解と変換：
$S$ の固有値分解を
$$
S=V\Lambda V^{\mathsf T}
$$
とする。$V=[\mathbf{v}_1,\dots,\mathbf{v}_p]$（直交行列）、$\Lambda=\mathrm{diag}(\lambda_1,\dots,\lambda_p)$（$\lambda_1\ge\cdots\ge\lambda_p\ge 0$）。主成分軸は $\mathbf{v}_i$、各主成分の分散は $\lambda_i$。中心化データ $X$ を $k$ 次元に写像するスコアは $$ Z=X V_k $$ ここで $V_k=[\mathbf{v}_1,\dots,\mathbf{v}_k]$。第 $i$ 主成分の列 $Z_{(\cdot,i)}$ は、$X$ を $\mathbf{v}_i$ に射影した値。再構成（低次元近似）は
$$
\hat{X}=Z V_k^{\mathsf T}=X V_k V_k^{\mathsf T}
$$
で与えられ、これが最小二乗誤差の意味で最良のランク $k$ 近似になる（エッカート・ヤングの定理；SVD と等価）。

二つの視点の同値：

固有値分解視点：$S$ を対角化して主方向と分散を得る。
SVD 視点：$X=U\Sigma V^{\mathsf T}$ として、$V$ が主成分軸、$\Sigma^2/(n-1)$ が固有値（分散）に一致。

(5) 寄与率（固有値の和＝全分散、寄与率の定義と性質の証明）

主張A（全分散＝固有値和）：
$$
\sum_{i=1}^{p}\lambda_i=\mathrm{tr}(S)=\sum_{j=1}^{p}S_{jj}
$$

証明：トレース不変性より
$$
\mathrm{tr}(S)=\mathrm{tr}(V\Lambda V^{\mathsf T})=\mathrm{tr}(\Lambda)=\sum_{i=1}^{p}\lambda_i
$$
一方、$\mathrm{tr}(S)=\sum_j S_{jj}$ は各変数の分散の総和であり、これが全分散。よって等しい。

主張B（寄与率の定義）：
第 $i$ 主成分の寄与率（全分散に占める割合）は
$$
\mathrm{CR}_i=\frac{\lambda_i}{\sum_{j=1}^{p}\lambda_j}
$$
累積寄与率は
$$
\mathrm{CCR}_k=\frac{\sum_{i=1}^{k}\lambda_i}{\sum_{j=1}^{p}\lambda_j}
$$

解釈：$\lambda_i$ は第 $i$ 主成分の分散、分子は「保持した情報量」、分母は「全情報量」。したがって比率が「どれだけの分散（情報）を説明できたか」を表す。SVD では $\lambda_i=\sigma_i^2/(n-1)$（$\sigma_i$ は特異値）なので、$\sum \sigma_i^2/(n-1)=\mathrm{tr}(S)$ と一致する。

4. Pythonによる実装

(1) 固有値分解による実装

import numpy as np

def pca_eig(X, k=None, center=True, scale=False):
    """
    PCA via eigen-decomposition of the covariance matrix (no scikit-learn).

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data. Rows are samples, columns are features.
    k : int or None
        Number of principal components to keep. If None, keep all.
    center : bool
        If True, subtract column means.
    scale : bool
        If True, standardize columns to unit variance after centering.

    Returns
    -------
    scores : ndarray, shape (n_samples, k)
        Projected data (principal component scores): X_centered @ components.
    components : ndarray, shape (n_features, k)
        Principal axes (eigenvectors / loadings). Columns are components.
    explained_variance : ndarray, shape (k,)
        Eigenvalues of the covariance matrix (variances of each PC).
    explained_variance_ratio : ndarray, shape (k,)
        Fraction of total variance explained by each PC.
    mean_ : ndarray, shape (n_features,)
        Per-feature means used for centering.
    scale_ : ndarray, shape (n_features,)
        Per-feature scales used for standardization (1.0 if not scaled).
    """
    X = np.asarray(X, dtype=float)
    n_samples, n_features = X.shape

    # 1) Center (and optionally scale) the data
    mean_ = X.mean(axis=0) if center else np.zeros(n_features)
    Xc = X - mean_
    if scale:
        # ddof=1 to match sample std used with covariance (n-1 in denominator)
        scale_ = Xc.std(axis=0, ddof=1)
        # Protect against zero-variance columns
        scale_[scale_ == 0.0] = 1.0
        Xc = Xc / scale_
    else:
        scale_ = np.ones(n_features)

    # 2) Covariance matrix (sample covariance, divisor n-1)
    # shape: (n_features, n_features)
    S = (Xc.T @ Xc) / (n_samples - 1)

    # 3) Symmetric eigen-decomposition
    # eigh returns eigenvalues in ascending order
    evals, evecs = np.linalg.eigh(S)

    # 4) Sort by descending eigenvalue (largest variance first)
    order = np.argsort(evals)[::-1]
    evals = evals[order]
    evecs = evecs[:, order]

    # 5) Select number of components
    if k is None:
        k = n_features
    k = int(k)
    components = evecs[:, :k]              # (n_features, k)
    explained_variance = evals[:k]         # (k,)
    total_var = evals.sum()
    explained_variance_ratio = explained_variance / total_var if total_var > 0 else np.zeros_like(explained_variance)

    # 6) Scores (projected data)
    scores = Xc @ components               # (n_samples, k)

    return scores, components, explained_variance, explained_variance_ratio, mean_, scale_


# ---------- Example usage ----------
if **name** == "**main**":
    # Toy data: 6 samples, 3 features
    X = np.array([
        [2.5, 2.4, 0.5],
        [0.5, 0.7, 0.1],
        [2.2, 2.9, 0.6],
        [1.9, 2.2, 0.4],
        [3.1, 3.0, 0.9],
        [2.3, 2.7, 0.7]
    ])

    scores, comps, ev, evr, mean_, scale_ = pca_eig(X, k=2, center=True, scale=False)

    print("Mean:", mean_)
    print("Components (columns are PCs):\n", comps)
    print("Explained variance:", ev)
    print("Explained variance ratio:", evr)
    print("Scores (projected data):\n", scores)

    # To transform NEW data X_new to the PC space:
    # X_new_scores = ((X_new - mean_) / scale_) @ comps

(2) 特異値分解による実装

import numpy as np

def pca_svd(X, k=None, center=True, scale=False):
    """
    PCA via Singular Value Decomposition (no scikit-learn).

    Parameters
    ----------
    X : ndarray of shape (n_samples, n_features)
        Input data (rows = samples, cols = features).
    k : int or None
        Number of principal components to keep. If None, keep all.
    center : bool
        If True, subtract column means.
    scale : bool
        If True, standardize columns to unit variance after centering.

    Returns
    -------
    scores : ndarray, shape (n_samples, k)
        Projected data in PC space (principal component scores).
    components : ndarray, shape (n_features, k)
        Principal axes / loadings (columns are PCs).
    explained_variance : ndarray, shape (k,)
        Variance of each selected PC = s_i^2 / (n_samples - 1).
    explained_variance_ratio : ndarray, shape (k,)
        Fraction of total variance explained by each selected PC.
    mean_ : ndarray, shape (n_features,)
        Column means used for centering.
    scale_ : ndarray, shape (n_features,)
        Column scales used for standardization (1.0 if not scaled).
    """
    X = np.asarray(X, dtype=float)
    n_samples, n_features = X.shape

    # 1) Center (and optionally scale) the data
    mean_ = X.mean(axis=0) if center else np.zeros(n_features)
    Xc = X - mean_
    if scale:
        scale_ = Xc.std(axis=0, ddof=1)
        scale_[scale_ == 0.0] = 1.0  # protect zero-variance cols
        Xc = Xc / scale_
    else:
        scale_ = np.ones(n_features)

    # 2) Thin SVD
    # Xc = U @ diag(s) @ Vt, with U (n_samples x r), Vt (r x n_features)
    U, s, Vt = np.linalg.svd(Xc, full_matrices=False)

    # 3) Choose number of components
    r = min(n_samples, n_features)
    if k is None:
        k = r
    k = int(k)

    # 4) Components (loadings): columns are PCs
    components = Vt[:k, :].T  # shape (n_features, k)

    # 5) Scores (projected data)
    # scores = U @ diag(s) using only top-k singular values
    scores = U[:, :k] \* s[:k]  # broadcasting, shape (n_samples, k)

    # 6) Explained variance and ratio
    # Total variance of centered data = sum(s^2)/(n_samples - 1)
    explained_variance = (s[:k] ** 2) / (n_samples - 1)
    total_variance = (s ** 2).sum() / (n_samples - 1)
    explained_variance_ratio = explained_variance / total_variance if total_variance > 0 else np.zeros_like(explained_variance)

    return scores, components, explained_variance, explained_variance_ratio, mean_, scale_


# ---------- Example usage ----------
if **name** == "**main**":
    # Toy data: 6 samples, 3 features
    X = np.array([
        [2.5, 2.4, 0.5],
        [0.5, 0.7, 0.1],
        [2.2, 2.9, 0.6],
        [1.9, 2.2, 0.4],
        [3.1, 3.0, 0.9],
        [2.3, 2.7, 0.7]
    ])

    scores, comps, ev, evr, mean_, scale_ = pca_svd(X, k=2, center=True, scale=False)

    print("Mean:", mean_)
    print("Components (columns are PCs):\n", comps)
    print("Explained variance:", ev)
    print("Explained variance ratio:", evr)
    print("Scores (projected data):\n", scores)

    # To transform NEW data X_new using the fitted model:
    # X_new_scores = ((X_new - mean_) / scale_) @ comps

(3) Pythonのscikit-learnによる実装

# scikit-learn PCA examples
import numpy as np
from sklearn.decomposition import PCA
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline

# --- Toy data: rows = samples, cols = features ---
X = np.array([
    [2.5, 2.4, 0.5],
    [0.5, 0.7, 0.1],
    [2.2, 2.9, 0.6],
    [1.9, 2.2, 0.4],
    [3.1, 3.0, 0.9],
    [2.3, 2.7, 0.7]
], dtype=float)

# ============================================================
# 1) BASIC PCA (no standardization). PCA uses SVD internally.
# ============================================================
pca = PCA(n_components=2, svd_solver="auto", random_state=0)
scores = pca.fit_transform(X)   # shape: (n_samples, n_components)

print("=== BASIC PCA ===")
print("Mean (per feature):", pca.mean_)
print("Components (rows are PCs):\n", pca.components_)
print("Explained variance:", pca.explained_variance_)
print("Explained variance ratio:", pca.explained_variance_ratio_)
print("Singular values:", pca.singular_values_)
print("Scores (projected data):\n", scores)

# Reconstruct back (rank-2 approximation here equals original rank if 2 >= rank)
X_recon = pca.inverse_transform(scores)
print("Reconstruction:\n", X_recon)

# ============================================================
# 2) STANDARDIZED PCA (common in practice)
#    Standardize features to zero-mean, unit-variance before PCA.
# ============================================================
pipe = Pipeline(steps=[
    ("scaler", StandardScaler(with_mean=True, with_std=True)),
    ("pca", PCA(n_components=0.95, svd_solver="full", random_state=0))  # keep 95% variance
])
scores_std = pipe.fit_transform(X)

print("\n=== STANDARDIZED PCA (keep 95% variance) ===")
pca_std = pipe.named_steps["pca"]
print("n_components_ selected:", pca_std.n_components_)
print("Explained variance ratio (cumulative):", np.cumsum(pca_std.explained_variance_ratio_))
print("Scores:\n", scores_std)

# Transform NEW data with the same pipeline (fit parameters reused)
X_new = np.array([[2.8, 2.6, 0.8],
                  [1.0, 0.9, 0.2]], dtype=float)
X_new_scores = pipe.transform(X_new)
print("New data scores:\n", X_new_scores)

# Inverse transform to original feature space (after PCA+scaler)
X_new_recon = pipe.inverse_transform(X_new_scores)
print("New data reconstruction:\n", X_new_recon)

# ============================================================
# 3) PCA with WHITENING (optional)
#    Whitening scales PC scores by 1/sqrt(eigenvalue) -> unit variance per PC.
#    Useful for some downstream models, but it changes interpretability.
# ============================================================
pipe_whiten = Pipeline(steps=[
    ("scaler", StandardScaler()),
    ("pca", PCA(n_components=2, whiten=True, random_state=0))
])
scores_white = pipe_whiten.fit_transform(X)

print("\n=== PCA with WHITENING ===")
pca_w = pipe_whiten.named_steps["pca"]
print("Components (rows are PCs):\n", pca_w.components_)
print("Explained variance:", pca_w.explained_variance_)
print("Explained variance ratio:", pca_w.explained_variance_ratio_)
print("Whitened scores (unit variance per PC):\n", scores_white)