Spectral Clustering¶

논문 정보¶

항목	내용
제목	Normalized Cuts and Image Segmentation
저자	Jianbo Shi, Jitendra Malik
연도	2000
학회	IEEE TPAMI
링크	IEEE

항목	내용
제목	On Spectral Clustering: Analysis and an Algorithm
저자	Andrew Y. Ng, Michael I. Jordan, Yair Weiss
연도	2001
학회	NeurIPS
링크	NeurIPS

핵심 아이디어¶

K-Means의 한계: - 볼록(convex) 클러스터만 잘 찾음 - 복잡한 형태(초승달, 동심원)에 실패

Spectral Clustering의 접근: - 데이터를 그래프로 표현 (유사도 = 엣지 가중치) - 그래프 분할 문제로 클러스터링 정의 - Laplacian 행렬의 고유벡터로 저차원 임베딩 - 임베딩된 공간에서 K-Means 적용

핵심 통찰: 그래프의 연결 구조를 분석하면 비볼록 클러스터도 찾을 수 있음.

알고리즘¶

그래프 표현¶

유사도 행렬 (Affinity Matrix) $W$
Gaussian (RBF) kernel: $W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)$
k-NN: 가장 가까운 k개 이웃만 연결
차수 행렬 (Degree Matrix) $D$
대각 행렬: $D_{ii} = \sum_j W_{ij}$
Laplacian 행렬 $L$
Unnormalized: $L = D - W$
Normalized (symmetric): $L_{sym} = D^{-1/2} L D^{-1/2} = I - D^{-1/2} W D^{-1/2}$
Normalized (random walk): $L_{rw} = D^{-1} L = I - D^{-1} W$

알고리즘 절차 (Ng-Jordan-Weiss)¶

입력: 데이터 X = {x₁, ..., xₙ}, 클러스터 수 K
출력: 클러스터 라벨

1. 유사도 행렬 구축:
   W_ij = exp(-‖xᵢ - xⱼ‖² / 2σ²)  (i ≠ j)
   W_ii = 0

2. 차수 행렬 계산:
   D_ii = Σⱼ W_ij

3. Normalized Laplacian 계산:
   L_sym = I - D^(-1/2) W D^(-1/2)

4. 고유값 분해:
   L_sym의 가장 작은 K개 고유값에 해당하는 고유벡터 v₁, ..., vₖ 계산

5. 임베딩 행렬 구성:
   U = [v₁ | v₂ | ... | vₖ] ∈ R^(n×K)

6. 행 정규화:
   각 행 uᵢ를 단위 벡터로 정규화: ûᵢ = uᵢ / ‖uᵢ‖

7. K-Means 적용:
   정규화된 행 {û₁, ..., ûₙ}에 K-Means 클러스터링

8. 라벨 반환:
   각 점 xᵢ에 K-Means 결과 라벨 할당

그래프 컷 관점¶

목표: 그래프를 K개의 부분으로 분할, 부분 내 연결은 강하고 부분 간 연결은 약하게

Normalized Cut (NCut): $$\text{NCut}(A, B) = \frac{\text{cut}(A, B)}{\text{vol}(A)} + \frac{\text{cut}(A, B)}{\text{vol}(B)}$$

cut(A, B): A와 B 사이의 엣지 가중치 합
vol(A): A 내 노드들의 차수 합

이 최적화는 NP-hard이지만, 완화(relaxation)하면 고유벡터 문제로 변환됨.

시간 복잡도¶

단계	복잡도
유사도 행렬	O(n²d)
고유값 분해	O(n³) 또는 O(Kn²) (부분)
K-Means	O(tnK)
전체	O(n³) 또는 희소 행렬로 개선

장단점¶

장점¶

장점	설명
비볼록 클러스터	임의 형태 클러스터 발견
이론적 배경	그래프 이론, 선형대수
간단한 구현	고유벡터 + K-Means
차원 불변	유사도만 정의하면 됨

단점¶

단점	설명
K 지정 필요	클러스터 수 사전 결정
확장성	O(n³)로 대규모 데이터 어려움
파라미터 민감	σ (RBF 폭) 선택 중요
메모리	O(n²) 유사도 행렬

언제 쓰는가¶

적합한 경우¶

비볼록 클러스터 (초승달, 동심원, 복잡한 형태)
데이터 크기가 중소규모 (< 10K)
이미지 세그멘테이션
클러스터가 연결성으로 정의될 때
K-Means가 실패하는 경우

부적합한 경우¶

대규모 데이터 (Nyström 근사 필요)
클러스터 수를 모를 때
실시간 처리
구형 클러스터 (K-Means가 더 효율적)

Python 코드¶

import numpy as np
from sklearn.cluster import SpectralClustering, KMeans
from sklearn.datasets import make_moons, make_circles
from sklearn.metrics.pairwise import rbf_kernel
import matplotlib.pyplot as plt

# 복잡한 형태의 데이터
X_moons, y_moons = make_moons(n_samples=300, noise=0.05, random_state=42)
X_circles, y_circles = make_circles(n_samples=300, noise=0.05, factor=0.5, random_state=42)

# Spectral Clustering
def plot_comparison(X, y_true, title):
    fig, axes = plt.subplots(1, 3, figsize=(15, 4))

    # Ground truth
    axes[0].scatter(X[:, 0], X[:, 1], c=y_true, cmap='viridis', s=30)
    axes[0].set_title('Ground Truth')

    # K-Means (실패 예상)
    kmeans = KMeans(n_clusters=2, random_state=42)
    labels_km = kmeans.fit_predict(X)
    axes[1].scatter(X[:, 0], X[:, 1], c=labels_km, cmap='viridis', s=30)
    axes[1].set_title('K-Means (Fails!)')

    # Spectral Clustering
    spectral = SpectralClustering(
        n_clusters=2,
        affinity='rbf',          # 'rbf', 'nearest_neighbors', 'precomputed'
        gamma=10,                # RBF 파라미터 (1 / 2σ²)
        n_neighbors=10,          # k-NN affinity용
        assign_labels='kmeans',  # 'kmeans' 또는 'discretize'
        random_state=42
    )
    labels_sp = spectral.fit_predict(X)
    axes[2].scatter(X[:, 0], X[:, 1], c=labels_sp, cmap='viridis', s=30)
    axes[2].set_title('Spectral Clustering (Success!)')

    fig.suptitle(title, fontsize=14)
    plt.tight_layout()
    plt.show()

plot_comparison(X_moons, y_moons, 'Moon Dataset')
plot_comparison(X_circles, y_circles, 'Circles Dataset')

Affinity 행렬 직접 구축¶

from sklearn.metrics.pairwise import rbf_kernel
from scipy.sparse.linalg import eigsh
from scipy.sparse import csr_matrix

def spectral_clustering_manual(X, n_clusters, gamma=1.0):
    """Spectral Clustering 수동 구현 (Ng-Jordan-Weiss)"""
    n = len(X)

    # 1. 유사도 행렬 (RBF kernel)
    W = rbf_kernel(X, gamma=gamma)
    np.fill_diagonal(W, 0)  # 자기 자신과의 유사도 제거

    # 2. 차수 행렬
    D = np.diag(W.sum(axis=1))
    D_inv_sqrt = np.diag(1.0 / np.sqrt(W.sum(axis=1) + 1e-10))

    # 3. Normalized Laplacian
    L_sym = np.eye(n) - D_inv_sqrt @ W @ D_inv_sqrt

    # 4. 고유값 분해 (가장 작은 K개)
    eigenvalues, eigenvectors = np.linalg.eigh(L_sym)
    U = eigenvectors[:, :n_clusters]

    # 5. 행 정규화
    U_normalized = U / np.linalg.norm(U, axis=1, keepdims=True)

    # 6. K-Means
    kmeans = KMeans(n_clusters=n_clusters, random_state=42)
    labels = kmeans.fit_predict(U_normalized)

    return labels, U_normalized, eigenvalues[:n_clusters]

labels, embedding, eigenvalues = spectral_clustering_manual(X_moons, n_clusters=2, gamma=10)

# 임베딩 공간 시각화
fig, axes = plt.subplots(1, 2, figsize=(12, 4))

axes[0].scatter(X_moons[:, 0], X_moons[:, 1], c=labels, cmap='viridis', s=30)
axes[0].set_title('Original Space')

axes[1].scatter(embedding[:, 0], embedding[:, 1], c=labels, cmap='viridis', s=30)
axes[1].set_title('Spectral Embedding')

plt.tight_layout()
plt.show()

print(f"Smallest eigenvalues: {eigenvalues}")

Gamma 선택¶

gammas = [0.1, 1, 10, 100]
fig, axes = plt.subplots(1, 4, figsize=(16, 3))

for ax, gamma in zip(axes, gammas):
    spectral = SpectralClustering(n_clusters=2, affinity='rbf', gamma=gamma, random_state=42)
    labels = spectral.fit_predict(X_moons)
    ax.scatter(X_moons[:, 0], X_moons[:, 1], c=labels, cmap='viridis', s=20)
    ax.set_title(f'gamma={gamma}')

plt.tight_layout()
plt.show()

k-NN Affinity¶

# k-NN 기반 affinity (희소 행렬)
spectral_knn = SpectralClustering(
    n_clusters=2,
    affinity='nearest_neighbors',
    n_neighbors=10,
    random_state=42
)
labels_knn = spectral_knn.fit_predict(X_moons)

plt.figure(figsize=(8, 6))
plt.scatter(X_moons[:, 0], X_moons[:, 1], c=labels_knn, cmap='viridis', s=30)
plt.title('Spectral Clustering with k-NN Affinity')
plt.show()

Eigengap으로 클러스터 수 추정¶

from scipy.sparse.linalg import eigsh

def estimate_n_clusters(X, gamma=1.0, max_k=10):
    """고유값 갭으로 클러스터 수 추정"""
    W = rbf_kernel(X, gamma=gamma)
    np.fill_diagonal(W, 0)
    D_inv_sqrt = np.diag(1.0 / np.sqrt(W.sum(axis=1) + 1e-10))
    L_sym = np.eye(len(X)) - D_inv_sqrt @ W @ D_inv_sqrt

    eigenvalues = np.linalg.eigvalsh(L_sym)[:max_k]

    # Eigengap
    gaps = np.diff(eigenvalues)

    fig, axes = plt.subplots(1, 2, figsize=(12, 4))

    axes[0].plot(range(1, max_k+1), eigenvalues, 'bo-')
    axes[0].set_xlabel('Index')
    axes[0].set_ylabel('Eigenvalue')
    axes[0].set_title('Eigenvalues of Laplacian')

    axes[1].bar(range(1, max_k), gaps)
    axes[1].set_xlabel('Gap Index')
    axes[1].set_ylabel('Eigengap')
    axes[1].set_title('Eigengap (largest = optimal K)')

    plt.tight_layout()
    plt.show()

    suggested_k = np.argmax(gaps) + 1
    print(f"Suggested number of clusters: {suggested_k}")
    return suggested_k

estimate_n_clusters(X_moons, gamma=10)

참고 문헌¶

Shi, J., & Malik, J. (2000). Normalized Cuts and Image Segmentation. IEEE TPAMI, 22(8), 888-905.
Ng, A. Y., Jordan, M. I., & Weiss, Y. (2001). On Spectral Clustering: Analysis and an Algorithm. NeurIPS.
Von Luxburg, U. (2007). A Tutorial on Spectral Clustering. Statistics and Computing, 17(4), 395-416.