이상 탐지 개요¶

이상 탐지(Anomaly Detection)는 정상 패턴에서 크게 벗어난 데이터 포인트를 식별하는 기법. 사기 탐지, 네트워크 침입 탐지, 제조 불량 검출, 의료 이상 진단 등에 활용됨.

핵심 개념¶

이상치 유형¶

유형	설명	예시
Point Anomaly	개별 데이터 포인트 이상	비정상적으로 큰 거래 금액
Contextual Anomaly	특정 맥락에서만 이상	여름에 난방 사용
Collective Anomaly	집합적으로 이상	연속된 비정상 로그

학습 설정¶

설정	레이블	특징
Supervised	정상/이상 모두	불균형 분류 문제
Semi-supervised	정상만	One-class 학습
Unsupervised	없음	정상 가정 기반

알고리즘 분류 체계¶

Anomaly Detection
├── Statistical Methods
│   ├── Z-score / Modified Z-score
│   ├── IQR (Interquartile Range)
│   ├── Grubbs Test
│   └── Mahalanobis Distance
├── Distance-based
│   ├── k-NN Distance
│   └── k-NN with Local Outlier
├── Density-based
│   ├── LOF (Local Outlier Factor)
│   ├── LOCI
│   └── COF
├── Clustering-based
│   ├── DBSCAN (noise points)
│   ├── k-Means distance
│   └── Cluster membership
├── Tree-based
│   ├── Isolation Forest
│   └── Extended Isolation Forest
├── One-class Methods
│   ├── One-class SVM (OCSVM)
│   └── SVDD
├── Deep Learning
│   ├── Autoencoder
│   ├── Variational Autoencoder (VAE)
│   ├── GAN-based (AnoGAN)
│   └── Deep SVDD
└── Time Series
    ├── ARIMA residuals
    ├── Prophet anomaly
    └── LSTM Autoencoder

통계적 방법¶

Z-score¶

\[z = \frac{x - \mu}{\sigma}\]

$|z| > 3$이면 이상치로 판단 (99.7% 범위 밖)

Modified Z-score (MAD 기반):

\[M_i = \frac{0.6745(x_i - \tilde{x})}{MAD}\]

여기서 $MAD = median(|x_i - \tilde{x}|)$

더 이상치에 강건함.

IQR Method¶

\[\text{Lower} = Q_1 - 1.5 \times IQR$$ $$\text{Upper} = Q_3 + 1.5 \times IQR\]

범위 밖 데이터는 이상치.

Isolation Forest¶

이상치를 고립시키는 데 필요한 분할 횟수 측정:

핵심 아이디어: - 이상치는 정상 데이터보다 적은 분할로 고립됨 - 평균 경로 길이가 짧을수록 이상치

이상 점수:

\[s(x, n) = 2^{-\frac{E[h(x)]}{c(n)}}\]

여기서: - $h(x)$: 샘플 $x$의 경로 길이 - $c(n)$: 정규화 상수

from sklearn.ensemble import IsolationForest
import numpy as np

# 모델 학습
clf = IsolationForest(
    n_estimators=100,
    contamination=0.05,  # 예상 이상치 비율
    random_state=42
)
clf.fit(X)

# 예측 (-1: 이상, 1: 정상)
predictions = clf.predict(X)
anomaly_scores = clf.decision_function(X)  # 낮을수록 이상

# 이상치 추출
anomalies = X[predictions == -1]

참고 논문: - Liu, F.T. et al. (2008). "Isolation Forest". ICDM. - Liu, F.T. et al. (2012). "Isolation-Based Anomaly Detection". TKDD.

Local Outlier Factor (LOF)¶

지역 밀도 대비 이웃의 밀도 비교:

Local Reachability Density:

\[lrd_k(x) = \frac{1}{\frac{\sum_{o \in N_k(x)} reach\text{-}dist_k(x, o)}{|N_k(x)|}}\]

LOF:

\[LOF_k(x) = \frac{\sum_{o \in N_k(x)} \frac{lrd_k(o)}{lrd_k(x)}}{|N_k(x)|}\]

LOF $\approx$ 1: 정상
LOF $>$ 1: 이상치 (주변보다 밀도 낮음)

from sklearn.neighbors import LocalOutlierFactor

lof = LocalOutlierFactor(
    n_neighbors=20,
    contamination=0.05,
    novelty=True  # True면 fit 후 새 데이터 예측 가능
)
lof.fit(X_train)

# 예측
predictions = lof.predict(X_test)
scores = lof.negative_outlier_factor_

참고 논문: - Breunig, M.M. et al. (2000). "LOF: Identifying Density-Based Local Outliers". SIGMOD.

One-class SVM¶

정상 데이터만으로 학습하여 경계 설정:

\[\min_{w, \rho, \xi} \frac{1}{2}\|w\|^2 + \frac{1}{\nu n}\sum_i \xi_i - \rho\]

\[\text{s.t. } w^T \phi(x_i) \geq \rho - \xi_i, \quad \xi_i \geq 0\]

from sklearn.svm import OneClassSVM

ocsvm = OneClassSVM(
    kernel='rbf',
    gamma='scale',
    nu=0.05  # 이상치 비율 상한
)
ocsvm.fit(X_train)

predictions = ocsvm.predict(X_test)

Autoencoder 기반¶

핵심 아이디어: - 정상 데이터로 Autoencoder 학습 - 이상치는 재구성 오차가 큼

\[\text{Anomaly Score} = \|x - \hat{x}\|^2\]

import torch
import torch.nn as nn

class Autoencoder(nn.Module):
    def __init__(self, input_dim, latent_dim):
        super().__init__()
        self.encoder = nn.Sequential(
            nn.Linear(input_dim, 64),
            nn.ReLU(),
            nn.Linear(64, latent_dim)
        )
        self.decoder = nn.Sequential(
            nn.Linear(latent_dim, 64),
            nn.ReLU(),
            nn.Linear(64, input_dim)
        )

    def forward(self, x):
        z = self.encoder(x)
        x_recon = self.decoder(z)
        return x_recon

# 학습 (정상 데이터만)
model = Autoencoder(input_dim, latent_dim=16)
criterion = nn.MSELoss()
optimizer = torch.optim.Adam(model.parameters())

for epoch in range(100):
    x_recon = model(X_normal)
    loss = criterion(x_recon, X_normal)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 이상 탐지
with torch.no_grad():
    x_recon = model(X_test)
    reconstruction_error = ((X_test - x_recon) ** 2).mean(dim=1)
    threshold = reconstruction_error.quantile(0.95)
    anomalies = reconstruction_error > threshold

Variational Autoencoder (VAE)¶

확률적 잠재 공간 + 재구성 오차:

\[\text{Anomaly Score} = -\text{ELBO} = \mathcal{L}_{recon} + D_{KL}(q(z|x) \| p(z))\]

시계열 이상 탐지¶

Seasonal-Trend Decomposition¶

from statsmodels.tsa.seasonal import STL

stl = STL(series, period=24)
result = stl.fit()

residual = result.resid
threshold = 3 * residual.std()
anomalies = np.abs(residual) > threshold

LSTM Autoencoder¶

class LSTMAutoencoder(nn.Module):
    def __init__(self, input_dim, hidden_dim, seq_len):
        super().__init__()
        self.encoder = nn.LSTM(input_dim, hidden_dim, batch_first=True)
        self.decoder = nn.LSTM(hidden_dim, hidden_dim, batch_first=True)
        self.output = nn.Linear(hidden_dim, input_dim)
        self.seq_len = seq_len

    def forward(self, x):
        # Encode
        _, (h, c) = self.encoder(x)

        # Decode
        decoder_input = h.repeat(1, self.seq_len, 1).transpose(0, 1)
        output, _ = self.decoder(decoder_input)
        return self.output(output)

평가 지표¶

지표	설명
Precision	탐지된 이상치 중 실제 이상치 비율
Recall	실제 이상치 중 탐지된 비율
F1-score	Precision과 Recall의 조화평균
AUC-ROC	모든 임계값에서의 성능
AUC-PR	불균형 데이터에 적합

실무 적용 가이드¶

방법 선택¶

이상 탐지 방법 선택
│
├── 레이블 있음?
│   └── 예 → 불균형 분류 (SMOTE, Class Weight)
│
├── 정상 데이터만?
│   └── 예 → One-class SVM, Autoencoder
│
├── 고차원?
│   └── 예 → Isolation Forest, Autoencoder
│
├── 해석 필요?
│   └── 예 → Isolation Forest, LOF
│
└── 시계열?
    └── 예 → LSTM-AE, STL, Prophet

참고 문헌¶

핵심 논문¶

Liu, F.T. et al. (2008). "Isolation Forest". ICDM.
Breunig, M.M. et al. (2000). "LOF". SIGMOD.
Scholkopf, B. et al. (2001). "Estimating the Support of a High-Dimensional Distribution". Neural Computation.

라이브러리¶

PyOD: https://pyod.readthedocs.io/
Alibi Detect: https://github.com/SeldonIO/alibi-detect
scikit-learn: IsolationForest, LOF, OCSVM