지도학습 개요¶

지도학습은 입력 변수 $X$와 목표 변수 $Y$ 간의 매핑 함수 $f: X \rightarrow Y$를 학습하는 기계학습 패러다임. 레이블이 있는 학습 데이터 $\{(x_i, y_i)\}_{i=1}^n$로부터 일반화 가능한 패턴을 추출함.

이론적 기초¶

통계적 학습 이론¶

지도학습의 목표는 기대 위험(Expected Risk)을 최소화하는 것이다:

\[R(f) = \mathbb{E}_{(X,Y) \sim P}[L(Y, f(X))]\]

그러나 실제로는 $P$를 알 수 없으므로 경험적 위험(Empirical Risk)을 최소화함:

\[\hat{R}(f) = \frac{1}{n} \sum_{i=1}^{n} L(y_i, f(x_i))\]

VC 이론 (Vapnik-Chervonenkis, 1971)에 따르면, 일반화 오차의 상한은 다음과 같다:

\[R(f) \leq \hat{R}(f) + O\left(\sqrt{\frac{d_{VC}}{n}}\right)\]

여기서 $d_{VC}$는 가설 공간의 복잡도를 나타내는 VC 차원.

개념	정의	실무적 의미
Bias	$\mathbb{E}[\hat{f}(x)] - f(x)$	모델의 가정이 현실과 다른 정도
Variance	$\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]$	학습 데이터 변화에 대한 민감도
Irreducible Error	데이터 내재 노이즈	줄일 수 없는 오차

참고 논문: - Vapnik, V.N. & Chervonenkis, A.Y. (1971). "On the Uniform Convergence of Relative Frequencies of Events to Their Probabilities". Theory of Probability and Its Applications. - Vapnik, V.N. (1999). "The Nature of Statistical Learning Theory". Springer.

분류 (Classification)¶

개요¶

분류는 이산적인 클래스 레이블 $Y \in \{1, 2, ..., K\}$를 예측하는 문제다.

\[\hat{y} = \arg\max_{k} P(Y=k|X=x)\]

알고리즘 분류 체계¶

Classification Algorithms
├── Linear Models
│   ├── Logistic Regression
│   ├── Linear Discriminant Analysis (LDA)
│   ├── Naive Bayes
│   └── Support Vector Machine (linear kernel)
├── Non-linear Models
│   ├── Kernel SVM (RBF, Polynomial)
│   ├── k-Nearest Neighbors
│   └── Gaussian Processes
├── Tree-based Models
│   ├── Decision Tree (CART, ID3, C4.5)
│   ├── Random Forest
│   ├── Gradient Boosting (XGBoost, LightGBM, CatBoost)
│   └── Extra Trees
└── Neural Networks
    ├── Multi-layer Perceptron
    ├── Convolutional Neural Networks
    └── Transformer-based Models

선형 분류 모델¶

Logistic Regression¶

이진 분류를 위한 선형 모델로, 시그모이드 함수를 통해 확률을 출력함:

\[P(Y=1|X=x) = \sigma(w^T x + b) = \frac{1}{1 + e^{-(w^T x + b)}}\]

손실 함수 (Binary Cross-Entropy):

\[\mathcal{L}(w) = -\frac{1}{n}\sum_{i=1}^{n}[y_i \log(\hat{p}_i) + (1-y_i)\log(1-\hat{p}_i)] + \lambda \|w\|_2^2\]

다중 클래스 확장 (Softmax Regression):

\[P(Y=k|X=x) = \frac{e^{w_k^T x}}{\sum_{j=1}^{K} e^{w_j^T x}}\]

Linear Discriminant Analysis (LDA)¶

Fisher의 LDA (1936)는 클래스 간 분산을 최대화하고 클래스 내 분산을 최소화하는 방향을 찾는다:

\[J(w) = \frac{w^T S_B w}{w^T S_W w}\]

여기서: - $S_B$: 클래스 간 산포 행렬 (Between-class scatter) - $S_W$: 클래스 내 산포 행렬 (Within-class scatter)

가정: - 각 클래스의 데이터가 정규분포를 따름 - 공분산 행렬이 동일함 (Homoscedasticity)

Naive Bayes¶

베이즈 정리와 특성 독립 가정을 결합한 분류기:

\[P(Y=k|X) \propto P(Y=k) \prod_{j=1}^{d} P(X_j|Y=k)\]

변형	분포 가정	적합한 데이터
Gaussian NB	$P(X_j	Y=k) \sim \mathcal{N}(\mu_{jk}, \sigma_{jk}^2)$
Multinomial NB	다항분포	텍스트 (BoW)
Bernoulli NB	베르누이 분포	이진 특성
Complement NB	보완 확률	불균형 텍스트

참고 논문: - McCallum, A. & Nigam, K. (1998). "A Comparison of Event Models for Naive Bayes Text Classification". AAAI Workshop.

Support Vector Machine (SVM)¶

마진을 최대화하는 결정 경계를 찾는다:

Primal 문제:

\[\min_{w, b} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}\xi_i\]

\[\text{s.t. } y_i(w^T x_i + b) \geq 1 - \xi_i, \quad \xi_i \geq 0\]

Dual 문제 (커널 트릭 적용 가능):

\[\max_{\alpha} \sum_{i=1}^{n}\alpha_i - \frac{1}{2}\sum_{i,j}\alpha_i \alpha_j y_i y_j K(x_i, x_j)\]

커널	수식	특성
Linear	$K(x, x') = x^T x'$	선형 분리 가능 데이터
Polynomial	$K(x, x') = (\gamma x^T x' + r)^d$	다항식 결정 경계
RBF (Gaussian)	$K(x, x') = \exp(-\gamma\\|x-x'\\|^2)$	범용, 복잡한 경계
Sigmoid	$K(x, x') = \tanh(\gamma x^T x' + r)$	신경망과 유사

참고 논문: - Cortes, C. & Vapnik, V. (1995). "Support-Vector Networks". Machine Learning. - Scholkopf, B. & Smola, A.J. (2002). "Learning with Kernels". MIT Press.

트리 기반 모델¶

Decision Tree¶

재귀적 분할을 통해 결정 규칙을 학습함:

분할 기준:

알고리즘	분할 기준	수식
ID3	Information Gain	$IG = H(D) - \sum_{v}\frac{
C4.5	Gain Ratio	$GR = \frac{IG}{SplitInfo}$
CART	Gini Impurity	$Gini = 1 - \sum_{k=1}^{K}p_k^2$

엔트로피:

\[H(D) = -\sum_{k=1}^{K} p_k \log_2(p_k)\]

지니 불순도:

\[Gini(D) = 1 - \sum_{k=1}^{K} p_k^2 = \sum_{k=1}^{K} p_k(1-p_k)\]

참고 논문: - Quinlan, J.R. (1986). "Induction of Decision Trees". Machine Learning. - Breiman, L. et al. (1984). "Classification and Regression Trees". Wadsworth.

Random Forest¶

Bagging + 특성 무작위 선택을 결합한 앙상블:

\[\hat{f}(x) = \frac{1}{B}\sum_{b=1}^{B} T_b(x)\]

핵심 아이디어: 1. Bootstrap 샘플링으로 $B$개의 학습 세트 생성 2. 각 분할에서 $m = \sqrt{p}$개의 특성만 랜덤 선택 3. 깊은 트리를 가지치기 없이 성장 4. 예측 시 다수결 투표 (분류) 또는 평균 (회귀)

Out-of-Bag (OOB) Error:

Bootstrap 샘플에 포함되지 않은 약 37%의 데이터로 검증

변수 중요도:

\[VI_j = \frac{1}{B}\sum_{b=1}^{B}\sum_{t \in T_b} \Delta \hat{R}_t \cdot \mathbf{1}(j_t = j)\]

참고 논문: - Breiman, L. (2001). "Random Forests". Machine Learning, 45(1), 5-32.

Gradient Boosting¶

순차적으로 약한 학습기를 추가하며 잔차를 학습:

\[F_m(x) = F_{m-1}(x) + \eta \cdot h_m(x)\]

여기서 $h_m$은 음의 그래디언트 방향으로 학습:

\[h_m = \arg\min_{h} \sum_{i=1}^{n} L(y_i, F_{m-1}(x_i) + h(x_i))\]

구현체	특징	장점
XGBoost	Regularized objective, Histogram-based	정형 데이터 벤치마크 우승
LightGBM	Leaf-wise growth, GOSS	대용량 데이터 빠름
CatBoost	Ordered boosting, Native categorical	범주형 특성 자동 처리
HistGradientBoosting	scikit-learn 내장	간편한 사용

XGBoost 목적 함수:

\[\mathcal{L}(\phi) = \sum_{i} l(y_i, \hat{y}_i) + \sum_{k} \Omega(f_k)\]

\[\Omega(f) = \gamma T + \frac{1}{2}\lambda \sum_{j=1}^{T} w_j^2\]

참고 논문: - Friedman, J.H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine". Annals of Statistics. - Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System". KDD. - Ke, G. et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". NeurIPS. - Prokhorenkova, L. et al. (2018). "CatBoost: Unbiased Boosting with Categorical Features". NeurIPS.

알고리즘 비교 및 선택 가이드¶

알고리즘	시간복잡도 (학습)	해석가능성	비선형성	대용량	결측치 처리
Logistic Regression	$O(npk)$	높음	낮음	우수	불가
SVM (RBF)	$O(n^2p)$ ~ $O(n^3)$	낮음	높음	제한	불가
Decision Tree	$O(np \log n)$	높음	중간	우수	가능
Random Forest	$O(B \cdot n \sqrt{p} \log n)$	중간	높음	우수	가능
XGBoost	$O(nBp)$	중간	높음	매우 우수	가능
LightGBM	$O(nBp)$	중간	높음	매우 우수	가능

선택 흐름도:

시작
│
├── 해석가능성 중요?
│   ├── 예 → Logistic Regression / Decision Tree
│   └── 아니오 →
│       │
│       ├── 데이터 크기 < 10K?
│       │   ├── 예 → SVM (RBF), Random Forest
│       │   └── 아니오 →
│       │       │
│       │       ├── 데이터 크기 < 100K?
│       │       │   └── XGBoost, Random Forest
│       │       │
│       │       └── 데이터 크기 > 100K?
│       │           └── LightGBM, CatBoost
│       │
│       └── 범주형 특성 많음?
│           └── CatBoost

회귀 (Regression)¶

개요¶

회귀는 연속적인 목표 변수 $Y \in \mathbb{R}$을 예측하는 문제다.

\[\hat{y} = \mathbb{E}[Y|X=x]\]

알고리즘 분류 체계¶

Regression Algorithms
├── Linear Models
│   ├── Ordinary Least Squares (OLS)
│   ├── Ridge Regression (L2)
│   ├── Lasso Regression (L1)
│   ├── Elastic Net
│   └── Bayesian Linear Regression
├── Non-linear Models
│   ├── Polynomial Regression
│   ├── Support Vector Regression (SVR)
│   ├── Kernel Ridge Regression
│   └── Gaussian Process Regression
├── Tree-based Models
│   ├── Decision Tree Regressor
│   ├── Random Forest Regressor
│   ├── Gradient Boosting Regressor
│   └── Extra Trees Regressor
├── Instance-based
│   └── k-Nearest Neighbors Regressor
└── Neural Networks
    ├── Multi-layer Perceptron
    └── Deep Learning Models

선형 회귀 모델¶

Ordinary Least Squares (OLS)¶

\[\hat{w} = \arg\min_{w} \|y - Xw\|_2^2 = (X^T X)^{-1} X^T y\]

가정 (Gauss-Markov): 1. 선형성: $E[Y|X] = X\beta$ 2. 엄격 외생성: $E[\epsilon|X] = 0$ 3. 구형 오차: $Var(\epsilon|X) = \sigma^2 I$ 4. 완전 열 순위: $rank(X) = p$

BLUE (Best Linear Unbiased Estimator):

Gauss-Markov 조건 하에서 OLS는 모든 선형 불편 추정량 중 분산이 가장 작음.

Ridge Regression (Tikhonov Regularization)¶

L2 정규화로 다중공선성 문제 해결:

\[\hat{w} = \arg\min_{w} \|y - Xw\|_2^2 + \lambda\|w\|_2^2 = (X^T X + \lambda I)^{-1} X^T y\]

기하학적 해석: - 계수를 0 방향으로 수축 (shrinkage) - 모든 특성 유지

Lasso Regression¶

L1 정규화로 희소 해(sparse solution) 유도:

\[\hat{w} = \arg\min_{w} \frac{1}{2n}\|y - Xw\|_2^2 + \lambda\|w\|_1\]

특징: - 자동 변수 선택 (일부 계수가 정확히 0) - 해석 가능한 모델 - Coordinate descent로 효율적 최적화

참고 논문: - Tibshirani, R. (1996). "Regression Shrinkage and Selection via the Lasso". JRSS-B.

Elastic Net¶

L1과 L2의 결합:

\[\hat{w} = \arg\min_{w} \frac{1}{2n}\|y - Xw\|_2^2 + \lambda_1\|w\|_1 + \lambda_2\|w\|_2^2\]

장점: - 상관된 특성 그룹을 함께 선택 - Lasso의 변수 선택 + Ridge의 안정성

참고 논문: - Zou, H. & Hastie, T. (2005). "Regularization and Variable Selection via the Elastic Net". JRSS-B.

Bayesian Linear Regression¶

사전 분포를 가정하고 사후 분포 추정:

Prior: $w \sim \mathcal{N}(0, \sigma_w^2 I)$

Posterior: $p(w|X, y) \propto p(y|X, w) p(w)$

MAP 추정 (Maximum A Posteriori):

\[\hat{w}_{MAP} = (X^T X + \frac{\sigma^2}{\sigma_w^2} I)^{-1} X^T y\]

이는 Ridge Regression과 동일 ($\lambda = \frac{\sigma^2}{\sigma_w^2}$)

비선형 회귀 모델¶

Support Vector Regression (SVR)¶

$\epsilon$-insensitive 손실 함수 사용:

\[\min_{w, b, \xi, \xi^*} \frac{1}{2}\|w\|^2 + C\sum_{i=1}^{n}(\xi_i + \xi_i^*)\]

\[\text{s.t. } y_i - w^T\phi(x_i) - b \leq \epsilon + \xi_i$$ $$w^T\phi(x_i) + b - y_i \leq \epsilon + \xi_i^*$$ $$\xi_i, \xi_i^* \geq 0\]

Gaussian Process Regression¶

비모수적 베이지안 접근법:

\[f(x) \sim \mathcal{GP}(m(x), k(x, x'))\]

예측 분포:

\[p(f_*|X, y, x_*) = \mathcal{N}(\bar{f}_*, \text{cov}(f_*))\]

\[\bar{f}_* = K_*^T(K + \sigma_n^2 I)^{-1}y\]

장점: - 불확실성 추정 내재 - 커널 선택으로 다양한 함수 클래스 표현

단점: - $O(n^3)$ 계산 복잡도 - 대용량 데이터에 부적합 (근사 필요)

참고 논문: - Rasmussen, C.E. & Williams, C.K.I. (2006). "Gaussian Processes for Machine Learning". MIT Press.

트리 기반 회귀¶

분류와 동일한 구조이나 분할 기준과 예측 방식이 다름:

분할 기준: MSE (Mean Squared Error) 감소

\[MSE = \frac{1}{n}\sum_{i=1}^{n}(y_i - \bar{y})^2\]

리프 노드 예측: 샘플 평균

\[\hat{y}_{leaf} = \frac{1}{|R_m|}\sum_{x_i \in R_m} y_i\]

평가 지표¶

지표	수식	해석
MSE	$\frac{1}{n}\sum_i(y_i - \hat{y}_i)^2$	제곱 오차 평균 (이상치 민감)
RMSE	$\sqrt{MSE}$	원래 단위로 해석 가능
MAE	$\frac{1}{n}\sum_i	y_i - \hat{y}_i
MAPE	$\frac{100}{n}\sum_i	\frac{y_i - \hat{y}_i}{y_i}
$R^2$	$1 - \frac{SS_{res}}{SS_{tot}}$	설명된 분산 비율
Adjusted $R^2$	$1 - (1-R^2)\frac{n-1}{n-p-1}$	특성 수 보정

실무 적용 가이드¶

데이터 전처리¶

import pandas as pd
import numpy as np
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.compose import ColumnTransformer
from sklearn.pipeline import Pipeline

# 전처리 파이프라인 구성
numeric_features = ['age', 'income', 'tenure']
categorical_features = ['gender', 'region', 'segment']

numeric_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ('imputer', SimpleImputer(strategy='constant', fill_value='missing')),
    ('onehot', OneHotEncoder(handle_unknown='ignore', sparse_output=False))
])

preprocessor = ColumnTransformer(
    transformers=[
        ('num', numeric_transformer, numeric_features),
        ('cat', categorical_transformer, categorical_features)
    ])

모델 학습 및 평가¶

from sklearn.model_selection import cross_val_score, GridSearchCV
from sklearn.ensemble import RandomForestClassifier, GradientBoostingClassifier
from sklearn.linear_model import LogisticRegression
from sklearn.metrics import classification_report, roc_auc_score
import lightgbm as lgb

# 파이프라인에 모델 추가
clf_pipeline = Pipeline(steps=[
    ('preprocessor', preprocessor),
    ('classifier', LogisticRegression(max_iter=1000))
])

# 교차 검증
cv_scores = cross_val_score(clf_pipeline, X_train, y_train, cv=5, scoring='roc_auc')
print(f"CV AUC: {cv_scores.mean():.4f} (+/- {cv_scores.std():.4f})")

# 하이퍼파라미터 튜닝
param_grid = {
    'classifier__C': [0.01, 0.1, 1, 10],
    'classifier__penalty': ['l1', 'l2'],
    'classifier__solver': ['saga']
}

grid_search = GridSearchCV(clf_pipeline, param_grid, cv=5, scoring='roc_auc', n_jobs=-1)
grid_search.fit(X_train, y_train)

print(f"Best params: {grid_search.best_params_}")
print(f"Best CV score: {grid_search.best_score_:.4f}")

LightGBM 최적화 예시¶

import lightgbm as lgb
from sklearn.model_selection import train_test_split
import optuna

def objective(trial):
    params = {
        'objective': 'binary',
        'metric': 'auc',
        'boosting_type': 'gbdt',
        'num_leaves': trial.suggest_int('num_leaves', 20, 300),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'feature_fraction': trial.suggest_float('feature_fraction', 0.5, 1.0),
        'bagging_fraction': trial.suggest_float('bagging_fraction', 0.5, 1.0),
        'bagging_freq': trial.suggest_int('bagging_freq', 1, 7),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'lambda_l1': trial.suggest_float('lambda_l1', 1e-8, 10.0, log=True),
        'lambda_l2': trial.suggest_float('lambda_l2', 1e-8, 10.0, log=True),
        'verbosity': -1,
        'random_state': 42
    }

    X_tr, X_val, y_tr, y_val = train_test_split(X_train, y_train, test_size=0.2, random_state=42)

    train_data = lgb.Dataset(X_tr, label=y_tr)
    valid_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

    model = lgb.train(
        params,
        train_data,
        num_boost_round=1000,
        valid_sets=[valid_data],
        callbacks=[lgb.early_stopping(50), lgb.log_evaluation(0)]
    )

    y_pred = model.predict(X_val)
    auc = roc_auc_score(y_val, y_pred)

    return auc

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

모델 해석¶

import shap

# SHAP 값 계산
explainer = shap.TreeExplainer(model)
shap_values = explainer.shap_values(X_test)

# 요약 플롯
shap.summary_plot(shap_values, X_test, feature_names=feature_names)

# 의존성 플롯
shap.dependence_plot('income', shap_values, X_test, feature_names=feature_names)

# 개별 예측 설명
shap.force_plot(explainer.expected_value, shap_values[0], X_test.iloc[0], feature_names=feature_names)

하위 문서¶

분류¶

기법	설명	링크
Logistic Regression	선형 분류의 기본, 확률 출력	logistic-regression.md
Decision Tree	규칙 기반 분류, 해석 용이	decision-tree.md
Random Forest	Bagging 앙상블, 범용성	random-forest.md
XGBoost	Gradient Boosting, 고성능	xgboost.md
LightGBM	대용량 데이터, 빠른 학습	lightgbm.md

회귀¶

기법	설명	링크
Linear Regression	선형 회귀의 기본	linear-regression.md
Ridge/Lasso	정규화 회귀	ridge-lasso.md
SVR	서포트 벡터 회귀	svr.md

참고 문헌¶

교과서¶

Hastie, T., Tibshirani, R., & Friedman, J. (2009). "The Elements of Statistical Learning". Springer.
Bishop, C.M. (2006). "Pattern Recognition and Machine Learning". Springer.
Murphy, K.P. (2012). "Machine Learning: A Probabilistic Perspective". MIT Press.

핵심 논문¶

Breiman, L. (2001). "Random Forests". Machine Learning, 45(1), 5-32.
Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System". KDD.
Ke, G. et al. (2017). "LightGBM: A Highly Efficient Gradient Boosting Decision Tree". NeurIPS.
Lundberg, S.M. & Lee, S.I. (2017). "A Unified Approach to Interpreting Model Predictions". NeurIPS.

실무 가이드¶

scikit-learn Documentation: https://scikit-learn.org/stable/
XGBoost Documentation: https://xgboost.readthedocs.io/
LightGBM Documentation: https://lightgbm.readthedocs.io/

개념	정의	실무적 의미
Bias	\(\mathbb{E}[\hat{f}(x)] - f(x)\)	모델의 가정이 현실과 다른 정도
Variance	\(\mathbb{E}[(\hat{f}(x) - \mathbb{E}[\hat{f}(x)])^2]\)	학습 데이터 변화에 대한 민감도
Irreducible Error	데이터 내재 노이즈	줄일 수 없는 오차

커널	수식	특성
Linear	\(K(x, x') = x^T x'\)	선형 분리 가능 데이터
Polynomial	\(K(x, x') = (\gamma x^T x' + r)^d\)	다항식 결정 경계
RBF (Gaussian)	\(K(x, x') = \exp(-\gamma\\|x-x'\\|^2)\)	범용, 복잡한 경계
Sigmoid	\(K(x, x') = \tanh(\gamma x^T x' + r)\)	신경망과 유사

알고리즘	시간복잡도 (학습)	해석가능성	비선형성	대용량	결측치 처리
Logistic Regression	\(O(npk)\)	높음	낮음	우수	불가
SVM (RBF)	\(O(n^2p)\) ~ \(O(n^3)\)	낮음	높음	제한	불가
Decision Tree	\(O(np \log n)\)	높음	중간	우수	가능
Random Forest	\(O(B \cdot n \sqrt{p} \log n)\)	중간	높음	우수	가능
XGBoost	\(O(nBp)\)	중간	높음	매우 우수	가능
LightGBM	\(O(nBp)\)	중간	높음	매우 우수	가능

지표	수식	해석
MSE	\(\frac{1}{n}\sum_i(y_i - \hat{y}_i)^2\)	제곱 오차 평균 (이상치 민감)
RMSE	\(\sqrt{MSE}\)	원래 단위로 해석 가능
MAE	$\frac{1}{n}\sum_i	y_i - \hat{y}_i
MAPE	$\frac{100}{n}\sum_i	\frac{y_i - \hat{y}_i}{y_i}
\(R^2\)	\(1 - \frac{SS_{res}}{SS_{tot}}\)	설명된 분산 비율
Adjusted \(R^2\)	\(1 - (1-R^2)\frac{n-1}{n-p-1}\)	특성 수 보정