비지도학습 개요¶

비지도학습은 레이블이 없는 데이터 $\{x_i\}_{i=1}^n$에서 숨겨진 구조, 패턴, 관계를 발견하는 기계학습 패러다임. 데이터의 분포 $P(X)$를 학습하거나 유용한 표현을 추출함.

이론적 기초¶

학습 목표¶

비지도학습의 주요 목표는 다음과 같다:

밀도 추정 (Density Estimation): $P(X)$ 추정
클러스터링 (Clustering): 유사한 데이터 그룹화
차원 축소 (Dimensionality Reduction): 저차원 표현 학습
이상치 탐지 (Anomaly Detection): 정상 패턴에서 벗어난 데이터 식별
표현 학습 (Representation Learning): 유용한 특성 자동 추출

클러스터링 목적 함수¶

Within-cluster variance 최소화:

\[\min_C \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2\]

Graph-based objective (Normalized Cut):

\[NCut(A, B) = \frac{cut(A, B)}{vol(A)} + \frac{cut(A, B)}{vol(B)}\]

클러스터링 (Clustering)¶

알고리즘 분류 체계¶

Clustering Algorithms
├── Centroid-based (Prototype-based)
│   ├── K-Means
│   ├── K-Means++
│   ├── K-Medoids (PAM)
│   ├── Mini-batch K-Means
│   └── Fuzzy C-Means
├── Density-based
│   ├── DBSCAN
│   ├── HDBSCAN
│   ├── OPTICS
│   └── DENCLUE
├── Hierarchical
│   ├── Agglomerative (Bottom-up)
│   ├── Divisive (Top-down)
│   └── BIRCH
├── Distribution-based
│   ├── Gaussian Mixture Model (GMM)
│   └── Bayesian GMM
├── Spectral
│   ├── Spectral Clustering
│   └── Normalized Cuts
└── Deep Learning-based
    ├── Deep Embedded Clustering (DEC)
    ├── Variational Deep Embedding (VaDE)
    └── Contrastive Clustering

Centroid-based Clustering¶

K-Means¶

Lloyd's algorithm (1982)로 구현되는 가장 대표적인 클러스터링 알고리즘:

목적 함수 (Inertia):

\[J = \sum_{k=1}^{K} \sum_{x_i \in C_k} \|x_i - \mu_k\|^2\]

알고리즘:

1. Initialize: 무작위로 K개의 중심점 선택
2. Assignment: 각 점을 가장 가까운 중심에 할당
   C_k = {x_i : ||x_i - μ_k|| ≤ ||x_i - μ_j|| for all j}
3. Update: 각 클러스터의 중심 재계산
   μ_k = (1/|C_k|) Σ_{x_i ∈ C_k} x_i
4. Repeat 2-3 until convergence

수렴 보장: 목적 함수가 단조 감소하며, 유한 시간 내 수렴

시간 복잡도: $O(nKdI)$ (n: 샘플 수, K: 클러스터 수, d: 차원, I: 반복 횟수)

한계: - K를 사전에 지정해야 함 - 구형(spherical) 클러스터 가정 - 초기값에 민감 - 이상치에 민감

참고 논문: - Lloyd, S.P. (1982). "Least Squares Quantization in PCM". IEEE Transactions on Information Theory. - MacQueen, J.B. (1967). "Some Methods for Classification and Analysis of Multivariate Observations". 5th Berkeley Symposium.

K-Means++¶

더 나은 초기화로 K-Means의 성능과 수렴 속도 개선:

초기화 알고리즘:

1. 첫 번째 중심점을 무작위로 선택
2. 각 점 x에 대해 가장 가까운 중심까지의 거리 D(x)^2 계산
3. D(x)^2에 비례하는 확률로 다음 중심점 선택
4. K개가 될 때까지 2-3 반복

이론적 보장:

\[\mathbb{E}[J_{k-means++}] \leq 8(\ln K + 2) \cdot J_{opt}\]

참고 논문: - Arthur, D. & Vassilvitskii, S. (2007). "K-means++: The Advantages of Careful Seeding". SODA.

K-Medoids (PAM)¶

중심점 대신 실제 데이터 포인트(medoid)를 사용:

\[\text{medoid}_k = \arg\min_{x_i \in C_k} \sum_{x_j \in C_k} d(x_i, x_j)\]

장점: - 이상치에 더 강건 - 임의의 거리 함수 사용 가능 - 해석 가능한 대표점

단점: - $O(K(n-K)^2)$ 복잡도로 느림

참고 논문: - Kaufman, L. & Rousseeuw, P.J. (1987). "Clustering by Means of Medoids". Statistical Data Analysis.

Density-based Clustering¶

DBSCAN¶

밀도 기반으로 임의 형태의 클러스터 발견:

핵심 개념: - $\epsilon$-neighborhood: $N_\epsilon(x) = \{y : d(x, y) \leq \epsilon\}$ - Core point: $|N_\epsilon(x)| \geq MinPts$ - Border point: Core point의 $\epsilon$-neighborhood 내에 있지만 자신은 Core가 아님 - Noise point: Core도 Border도 아닌 점

알고리즘:

1. 모든 점을 unvisited로 표시
2. 각 unvisited 점 p에 대해:
   a. visited로 표시
   b. N_ε(p) 계산
   c. |N_ε(p)| < MinPts이면 noise로 표시 (나중에 border가 될 수 있음)
   d. |N_ε(p)| ≥ MinPts이면 새 클러스터 C 시작
      - p와 N_ε(p)를 C에 추가
      - N_ε(p)의 각 점 q에 대해 확장

장점: - K를 미리 지정할 필요 없음 - 임의 형태의 클러스터 발견 - 이상치 자동 식별

단점: - $\epsilon$, MinPts 파라미터 선택 어려움 - 밀도가 다양한 클러스터 처리 어려움 - 고차원에서 거리 의미 퇴색

시간 복잡도: $O(n \log n)$ (공간 인덱스 사용 시), $O(n^2)$ (그렇지 않을 때)

참고 논문: - Ester, M. et al. (1996). "A Density-Based Algorithm for Discovering Clusters in Large Spatial Databases with Noise". KDD.

HDBSCAN¶

Hierarchical DBSCAN - 다양한 밀도의 클러스터 자동 탐지:

핵심 아이디어: 1. Mutual reachability distance로 그래프 구성 2. 최소 신장 트리 (MST) 생성 3. 클러스터 계층 구조 추출 4. 안정적인 클러스터 선택 (persistence)

Mutual reachability distance:

\[d_{mreach}(a, b) = \max\{core_k(a), core_k(b), d(a, b)\}\]

장점: - $\epsilon$ 파라미터 불필요 (자동 추정) - 다양한 밀도의 클러스터 처리 - 클러스터 안정성 점수 제공 - 노이즈 자동 탐지

참고 논문: - Campello, R.J.G.B. et al. (2013). "Density-Based Clustering Based on Hierarchical Density Estimates". PAKDD. - McInnes, L. et al. (2017). "HDBSCAN: Hierarchical Density Based Clustering". JOSS.

OPTICS¶

클러스터 구조를 시각화하는 순서 생성:

핵심 출력: - Reachability distance plot: 클러스터 구조 시각화 - 다양한 $\epsilon$에 대한 클러스터 추출 가능

참고 논문: - Ankerst, M. et al. (1999). "OPTICS: Ordering Points To Identify the Clustering Structure". SIGMOD.

Hierarchical Clustering¶

Agglomerative (병합)¶

Bottom-up 방식으로 클러스터 병합:

연결 기준 (Linkage):

방법	수식	특성
Single	$\min_{a \in A, b \in B} d(a, b)$	체인 효과, 긴 클러스터
Complete	$\max_{a \in A, b \in B} d(a, b)$	구형 클러스터, 이상치 민감
Average	$\frac{1}{	A
Ward	$\Delta(A, B) = n_A n_B/(n_A + n_B) \\|\mu_A - \mu_B\\|^2$	분산 최소화, 구형

시간 복잡도: $O(n^2 \log n)$ ~ $O(n^3)$

BIRCH¶

대용량 데이터를 위한 메모리 효율적 클러스터링:

CF (Clustering Feature) 트리: - $CF = (N, LS, SS)$ - N: 포인트 수 - LS: 선형 합 $\sum x_i$ - SS: 제곱 합 $\sum x_i^2$

장점: - 단일 스캔으로 처리 가능 - 메모리 제한 내에서 동작 - 증분 학습 가능

참고 논문: - Zhang, T. et al. (1996). "BIRCH: An Efficient Data Clustering Method for Very Large Databases". SIGMOD.

Distribution-based Clustering¶

Gaussian Mixture Model (GMM)¶

데이터가 K개의 가우시안 분포 혼합에서 생성됨고 가정:

\[P(x) = \sum_{k=1}^{K} \pi_k \mathcal{N}(x | \mu_k, \Sigma_k)\]

EM 알고리즘:

E-step (Expectation): $$\gamma_{ik} = \frac{\pi_k \mathcal{N}(x_i | \mu_k, \Sigma_k)}{\sum_j \pi_j \mathcal{N}(x_i | \mu_j, \Sigma_j)}$$

M-step (Maximization): $$\mu_k^{new} = \frac{\sum_i \gamma_{ik} x_i}{\sum_i \gamma_{ik}}$$ $$\Sigma_k^{new} = \frac{\sum_i \gamma_{ik} (x_i - \mu_k)(x_i - \mu_k)^T}{\sum_i \gamma_{ik}}$$ $$\pi_k^{new} = \frac{1}{n}\sum_i \gamma_{ik}$$

장점: - Soft assignment (확률적 소속) - 타원형 클러스터 모델링 - 통계적 기반

모델 선택: BIC, AIC로 K 결정

\[BIC = -2 \ln L + k \ln n\]

참고 논문: - Dempster, A.P. et al. (1977). "Maximum Likelihood from Incomplete Data via the EM Algorithm". JRSS-B.

Spectral Clustering¶

그래프 기반 접근법으로 비볼록 클러스터 발견:

알고리즘:

1. Similarity matrix W 구성 (예: RBF 커널)
   W_ij = exp(-||x_i - x_j||^2 / (2σ^2))
2. Degree matrix D 계산: D_ii = Σ_j W_ij
3. Laplacian 계산
   - Unnormalized: L = D - W
   - Normalized: L_sym = I - D^(-1/2)WD^(-1/2)
4. L의 첫 k개 eigenvector 계산
5. Eigenvector로 구성된 행렬에서 K-Means 적용

Normalized Cut 최소화:

\[NCut = \text{Tr}(H^T L_{sym} H)\]

장점: - 비볼록, 비선형 경계 - 매니폴드 구조 포착

단점: - $O(n^3)$ eigen-decomposition - 대용량 데이터에 부적합 (근사 필요)

참고 논문: - Ng, A.Y. et al. (2001). "On Spectral Clustering: Analysis and an Algorithm". NeurIPS. - Von Luxburg, U. (2007). "A Tutorial on Spectral Clustering". Statistics and Computing.

클러스터 수 결정¶

방법	원리	수식/접근
Elbow Method	Inertia 감소율 급변점	Plot K vs Inertia
Silhouette Score	응집도와 분리도 균형	$s(i) = \frac{b(i) - a(i)}{\max(a(i), b(i))}$
Gap Statistic	실제 vs 랜덤 분산 비교	$Gap_n(k) = E[\log W_k^*] - \log W_k$
BIC/AIC	정보량 기준	GMM에 적용
DBCV	밀도 기반 클러스터 검증	HDBSCAN과 함께

클러스터링 평가 지표¶

내부 지표 (레이블 없이):

지표	수식	범위	해석
Silhouette	$\frac{b - a}{\max(a, b)}$	[-1, 1]	높을수록 좋음
Calinski-Harabasz	$\frac{SS_B / (k-1)}{SS_W / (n-k)}$	[0, $\infty$)	높을수록 좋음
Davies-Bouldin	$\frac{1}{k}\sum_i \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(c_i, c_j)}$	[0, $\infty$)	낮을수록 좋음

외부 지표 (정답 레이블과 비교):

지표	설명	범위
Adjusted Rand Index	무작위 조정된 Rand Index	[-1, 1]
Normalized Mutual Information	상호정보량 정규화	[0, 1]
V-measure	Homogeneity와 Completeness의 조화평균	[0, 1]

차원 축소 (Dimensionality Reduction)¶

알고리즘 분류 체계¶

Dimensionality Reduction
├── Linear Methods
│   ├── PCA (Principal Component Analysis)
│   ├── Factor Analysis
│   ├── ICA (Independent Component Analysis)
│   └── LDA (Linear Discriminant Analysis) *
├── Non-linear Methods (Manifold Learning)
│   ├── t-SNE
│   ├── UMAP
│   ├── Isomap
│   ├── LLE (Locally Linear Embedding)
│   ├── MDS (Multidimensional Scaling)
│   └── Laplacian Eigenmaps
├── Sparse Methods
│   ├── Sparse PCA
│   └── Dictionary Learning
└── Deep Learning
    ├── Autoencoder
    ├── Variational Autoencoder (VAE)
    └── Contrastive Learning

*LDA는 지도학습 방법이지만 차원 축소에도 사용

Principal Component Analysis (PCA)¶

데이터의 분산을 최대화하는 직교 방향(주성분) 탐색:

최적화 문제:

\[\max_w w^T S w \quad \text{s.t. } w^T w = 1\]

여기서 $S = \frac{1}{n}\sum_i (x_i - \bar{x})(x_i - \bar{x})^T$는 공분산 행렬.

해: $S$의 고유벡터가 주성분, 고유값이 설명 분산

알고리즘:

1. 데이터 중심화: X_centered = X - mean(X)
2. 공분산 행렬 계산: S = X_centered^T X_centered / (n-1)
3. 고유값 분해: S = VΛV^T
4. 상위 k개 고유벡터 선택
5. 투영: Z = X_centered @ V[:, :k]

설명 분산 비율:

\[\text{Explained Variance Ratio}_k = \frac{\lambda_k}{\sum_j \lambda_j}\]

Kernel PCA:

비선형 데이터를 위해 커널 트릭 적용:

\[K_{ij} = \phi(x_i)^T \phi(x_j) = k(x_i, x_j)\]

참고 논문: - Pearson, K. (1901). "On Lines and Planes of Closest Fit to Systems of Points in Space". Philosophical Magazine. - Hotelling, H. (1933). "Analysis of a Complex of Statistical Variables into Principal Components". Journal of Educational Psychology.

t-SNE (t-distributed Stochastic Neighbor Embedding)¶

고차원에서의 유사도를 저차원에서 보존하는 시각화 기법:

고차원 유사도 (Gaussian):

\[p_{j|i} = \frac{\exp(-\|x_i - x_j\|^2 / 2\sigma_i^2)}{\sum_{k \neq i} \exp(-\|x_i - x_k\|^2 / 2\sigma_i^2)}\]

\[p_{ij} = \frac{p_{j|i} + p_{i|j}}{2n}\]

저차원 유사도 (t-distribution with df=1):

\[q_{ij} = \frac{(1 + \|y_i - y_j\|^2)^{-1}}{\sum_{k \neq l}(1 + \|y_k - y_l\|^2)^{-1}}\]

목적 함수 (KL Divergence):

\[C = KL(P \| Q) = \sum_{i \neq j} p_{ij} \log \frac{p_{ij}}{q_{ij}}\]

특징: - 지역 구조 보존에 탁월 - 전역 구조는 왜곡될 수 있음 - Perplexity 파라미터 중요 (5-50) - 확률적이므로 매번 결과 다름

한계: - $O(n^2)$ 복잡도 (Barnes-Hut: $O(n \log n)$) - 새로운 데이터 투영 불가 (out-of-sample) - 클러스터 크기/거리 해석 주의

참고 논문: - Van der Maaten, L. & Hinton, G. (2008). "Visualizing Data using t-SNE". JMLR.

UMAP (Uniform Manifold Approximation and Projection)¶

t-SNE보다 빠르고 전역 구조를 더 잘 보존:

이론적 기반: - Riemannian 기하학 - 대수적 위상수학 (fuzzy simplicial sets)

고차원 그래프 (Fuzzy set membership):

\[\mu_{ij} = \exp\left(-\frac{d(x_i, x_j) - \rho_i}{\sigma_i}\right)\]

저차원 그래프:

\[\nu_{ij} = (1 + a \|y_i - y_j\|^{2b})^{-1}\]

목적 함수 (Cross-entropy):

\[CE = \sum_{ij} \left[ \mu_{ij} \log \frac{\mu_{ij}}{\nu_{ij}} + (1-\mu_{ij}) \log \frac{1-\mu_{ij}}{1-\nu_{ij}} \right]\]

장점: - t-SNE보다 빠름 ($O(n)$ 시간) - 전역 구조 더 잘 보존 - 새로운 데이터 투영 가능 (transform) - 결정적 결과 (random_state 고정 시) - 2D 이상 차원으로도 축소 가능

참고 논문: - McInnes, L. et al. (2018). "UMAP: Uniform Manifold Approximation and Projection for Dimension Reduction". arXiv.

방법 비교¶

방법	선형성	복잡도	전역 구조	지역 구조	새 데이터	용도
PCA	선형	$O(np^2)$	보존	일부	가능	전처리, 특성 추출
t-SNE	비선형	$O(n^2)$	왜곡	우수	불가	시각화
UMAP	비선형	$O(n)$	양호	우수	가능	시각화, 전처리
Isomap	비선형	$O(n^3)$	양호	양호	가능	매니폴드
LLE	비선형	$O(n^2)$	약함	우수	제한	매니폴드

이상치 탐지 (Anomaly Detection)¶

알고리즘 분류¶

Anomaly Detection
├── Statistical Methods
│   ├── Z-score
│   ├── IQR Method
│   └── Grubbs Test
├── Density-based
│   ├── LOF (Local Outlier Factor)
│   ├── DBSCAN (noise points)
│   └── KNN distance
├── Clustering-based
│   ├── Distance from centroid
│   └── Cluster membership
├── Model-based
│   ├── Isolation Forest
│   ├── One-Class SVM
│   └── Elliptic Envelope
└── Deep Learning
    ├── Autoencoder reconstruction
    └── Variational Autoencoder

Isolation Forest¶

랜덤 분할로 이상치 고립에 필요한 경로 길이 측정:

핵심 아이디어: - 이상치는 정상 데이터보다 적은 분할로 고립됨 - 경로 길이가 짧을수록 이상치 가능성 높음

이상 점수:

\[s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}\]

여기서 $h(x)$는 경로 길이, $c(n)$은 정규화 상수.

참고 논문: - Liu, F.T. et al. (2008). "Isolation Forest". ICDM.

Local Outlier Factor (LOF)¶

지역 밀도 대비 이웃의 밀도 비교:

\[LOF_k(x) = \frac{\sum_{y \in N_k(x)} \frac{lrd_k(y)}{lrd_k(x)}}{|N_k(x)|}\]

LOF $\approx$ 1: 정상
LOF $>$ 1: 이상치 (주변보다 밀도 낮음)
LOF $<$ 1: 밀집 지역 내

참고 논문: - Breunig, M.M. et al. (2000). "LOF: Identifying Density-Based Local Outliers". SIGMOD.

실무 적용 가이드¶

클러스터링 파이프라인¶

import numpy as np
from sklearn.preprocessing import StandardScaler
from sklearn.cluster import KMeans, DBSCAN
from sklearn.mixture import GaussianMixture
from sklearn.metrics import silhouette_score, calinski_harabasz_score
import hdbscan

# 데이터 전처리
scaler = StandardScaler()
X_scaled = scaler.fit_transform(X)

# 여러 알고리즘 비교
results = {}

# K-Means (여러 K에 대해)
for k in range(2, 11):
    kmeans = KMeans(n_clusters=k, random_state=42, n_init=10)
    labels = kmeans.fit_predict(X_scaled)

    results[f'kmeans_k{k}'] = {
        'labels': labels,
        'silhouette': silhouette_score(X_scaled, labels),
        'calinski': calinski_harabasz_score(X_scaled, labels),
        'inertia': kmeans.inertia_
    }

# HDBSCAN
clusterer = hdbscan.HDBSCAN(min_cluster_size=15, min_samples=5)
labels = clusterer.fit_predict(X_scaled)
mask = labels >= 0  # 노이즈 제외
if mask.sum() > 0:
    results['hdbscan'] = {
        'labels': labels,
        'silhouette': silhouette_score(X_scaled[mask], labels[mask]),
        'n_clusters': len(set(labels)) - (1 if -1 in labels else 0),
        'noise_ratio': (labels == -1).mean()
    }

# GMM (여러 컴포넌트에 대해)
for n in range(2, 11):
    gmm = GaussianMixture(n_components=n, random_state=42)
    labels = gmm.fit_predict(X_scaled)

    results[f'gmm_n{n}'] = {
        'labels': labels,
        'silhouette': silhouette_score(X_scaled, labels),
        'bic': gmm.bic(X_scaled),
        'aic': gmm.aic(X_scaled)
    }

# 최적 모델 선택
best_silhouette = max(results.items(), key=lambda x: x[1].get('silhouette', -1))
print(f"Best by Silhouette: {best_silhouette[0]} = {best_silhouette[1]['silhouette']:.4f}")

차원 축소 및 시각화¶

from sklearn.decomposition import PCA
from sklearn.manifold import TSNE
import umap
import matplotlib.pyplot as plt

# PCA
pca = PCA(n_components=2)
X_pca = pca.fit_transform(X_scaled)
print(f"PCA explained variance: {pca.explained_variance_ratio_.sum():.2%}")

# t-SNE
tsne = TSNE(n_components=2, perplexity=30, random_state=42)
X_tsne = tsne.fit_transform(X_scaled)

# UMAP
reducer = umap.UMAP(n_components=2, n_neighbors=15, min_dist=0.1, random_state=42)
X_umap = reducer.fit_transform(X_scaled)

# 시각화
fig, axes = plt.subplots(1, 3, figsize=(15, 5))

for ax, (X_2d, title) in zip(axes, [(X_pca, 'PCA'), (X_tsne, 't-SNE'), (X_umap, 'UMAP')]):
    scatter = ax.scatter(X_2d[:, 0], X_2d[:, 1], c=labels, cmap='tab10', s=10, alpha=0.7)
    ax.set_title(title)
    ax.set_xlabel('Component 1')
    ax.set_ylabel('Component 2')

plt.tight_layout()
plt.savefig('dimensionality_reduction_comparison.png', dpi=150)

클러스터 프로파일링¶

import pandas as pd

def profile_clusters(df, cluster_labels, numeric_cols, categorical_cols=None):
    """클러스터별 특성 프로파일링"""
    df = df.copy()
    df['cluster'] = cluster_labels

    # 수치형 변수 통계
    numeric_profile = df.groupby('cluster')[numeric_cols].agg(['mean', 'median', 'std'])

    # 전체 평균 대비 비율
    overall_mean = df[numeric_cols].mean()
    cluster_means = df.groupby('cluster')[numeric_cols].mean()
    relative = (cluster_means / overall_mean - 1) * 100  # 퍼센트 차이

    # 범주형 변수 분포
    if categorical_cols:
        for col in categorical_cols:
            cat_dist = df.groupby('cluster')[col].value_counts(normalize=True)
            print(f"\n{col} distribution by cluster:")
            print(cat_dist.unstack(fill_value=0))

    # 클러스터 크기
    cluster_sizes = df['cluster'].value_counts().sort_index()
    print(f"\nCluster sizes:\n{cluster_sizes}")
    print(f"\nCluster proportions:\n{cluster_sizes / len(df) * 100:.1f}%")

    return numeric_profile, relative

# 사용 예시
numeric_cols = ['age', 'income', 'purchase_frequency', 'avg_order_value']
profile, relative = profile_clusters(df, labels, numeric_cols)
print("\nRelative to overall mean (%):")
print(relative.round(1))

하위 문서¶

클러스터링¶

유형	알고리즘	링크
Centroid-based	K-Means, K-Means++, K-Medoids	clustering/centroid-based/
Density-based	DBSCAN, HDBSCAN, OPTICS	clustering/density-based/
Hierarchical	Agglomerative, BIRCH	clustering/hierarchical/
Distribution-based	GMM	clustering/distribution-based/
Spectral	Spectral Clustering	clustering/spectral/
Evaluation	Silhouette, Davies-Bouldin	clustering/evaluation/

참고 문헌¶

교과서¶

Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning". Springer. (Ch. 14)
Bishop, C.M. (2006). "Pattern Recognition and Machine Learning". Springer. (Ch. 9)
Aggarwal, C.C. & Reddy, C.K. (2013). "Data Clustering: Algorithms and Applications". CRC Press.

핵심 논문¶

Lloyd, S.P. (1982). "Least Squares Quantization in PCM". IEEE TIT.
Ester, M. et al. (1996). "DBSCAN". KDD.
Arthur, D. & Vassilvitskii, S. (2007). "K-means++". SODA.
Campello, R.J.G.B. et al. (2013). "HDBSCAN". PAKDD.
McInnes, L. et al. (2018). "UMAP". arXiv.
Van der Maaten, L. & Hinton, G. (2008). "t-SNE". JMLR.

실무 가이드¶

scikit-learn Clustering: https://scikit-learn.org/stable/modules/clustering.html
hdbscan Documentation: https://hdbscan.readthedocs.io/
UMAP Documentation: https://umap-learn.readthedocs.io/

방법	수식	특성
Single	\(\min_{a \in A, b \in B} d(a, b)\)	체인 효과, 긴 클러스터
Complete	\(\max_{a \in A, b \in B} d(a, b)\)	구형 클러스터, 이상치 민감
Average	$\frac{1}{	A
Ward	\(\Delta(A, B) = n_A n_B/(n_A + n_B) \\|\mu_A - \mu_B\\|^2\)	분산 최소화, 구형

지표	수식	범위	해석
Silhouette	\(\frac{b - a}{\max(a, b)}\)	[-1, 1]	높을수록 좋음
Calinski-Harabasz	\(\frac{SS_B / (k-1)}{SS_W / (n-k)}\)	[0, \(\infty\))	높을수록 좋음
Davies-Bouldin	\(\frac{1}{k}\sum_i \max_{j \neq i} \frac{\sigma_i + \sigma_j}{d(c_i, c_j)}\)	[0, \(\infty\))	낮을수록 좋음

방법	선형성	복잡도	전역 구조	지역 구조	새 데이터	용도
PCA	선형	\(O(np^2)\)	보존	일부	가능	전처리, 특성 추출
t-SNE	비선형	\(O(n^2)\)	왜곡	우수	불가	시각화
UMAP	비선형	\(O(n)\)	양호	우수	가능	시각화, 전처리
Isomap	비선형	\(O(n^3)\)	양호	양호	가능	매니폴드
LLE	비선형	\(O(n^2)\)	약함	우수	제한	매니폴드