Boosting¶

논문 정보¶

모델	논문	저자	연도
AdaBoost	A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting	Yoav Freund, Robert Schapire	1997
Gradient Boosting	Greedy Function Approximation: A Gradient Boosting Machine	Jerome Friedman	2001

개요¶

핵심 아이디어¶

"약한 학습기를 순차적으로 결합하여 강한 학습기를 만든다"

이전 모델이 틀린 샘플에 집중
순차적 학습으로 편향(Bias) 감소

Bagging vs Boosting¶

특성	Bagging	Boosting
학습 방식	병렬 (독립)	순차 (의존)
샘플 가중치	동일	오분류 샘플에 높음
목표	Variance 감소	Bias 감소
과적합 위험	낮음	있음
기반 모델	복잡한 모델	단순한 모델 (Weak Learner)

AdaBoost (Adaptive Boosting)¶

알고리즘¶

Algorithm: AdaBoost

Input: 학습 데이터 {(x_i, y_i)}, y_i in {-1, +1}, 반복 횟수 T
Output: 최종 분류기 H(x)

# 초기화: 모든 샘플 동일 가중치
w_i = 1/n for all i

for t = 1 to T:
    # 1. 가중치 분포로 약한 학습기 학습
    h_t = train_weak_learner(X, y, w)

    # 2. 가중 오분류율 계산
    err_t = sum(w_i * I(y_i != h_t(x_i))) / sum(w_i)

    # 3. 학습기 가중치 계산
    alpha_t = 0.5 * ln((1 - err_t) / err_t)

    # 4. 샘플 가중치 업데이트
    w_i = w_i * exp(-alpha_t * y_i * h_t(x_i))

    # 5. 가중치 정규화
    w_i = w_i / sum(w)

# 최종 분류기
H(x) = sign(sum_{t=1}^{T} alpha_t * h_t(x))

핵심 수식¶

학습기 가중치:

\[\alpha_t = \frac{1}{2} \ln\left(\frac{1 - \epsilon_t}{\epsilon_t}\right)\]

$\epsilon_t < 0.5$: 양수 가중치 (정확한 분류기)
$\epsilon_t = 0.5$: 가중치 0 (랜덤 추측)
$\epsilon_t > 0.5$: 음수 가중치 (반대로 사용)

샘플 가중치 업데이트:

\[w_i^{(t+1)} = w_i^{(t)} \cdot e^{-\alpha_t y_i h_t(x_i)}\]

맞춘 샘플: 가중치 감소 ($y_i h_t(x_i) = +1$)
틀린 샘플: 가중치 증가 ($y_i h_t(x_i) = -1$)

시각화¶

Round 1:                Round 2:                Round 3:
 o o o o o               o o o o o               o o o o o
 o o x o o   ────►      o o X o o   ────►       o o X o o
 o x x x o              o X X X o               o X X X o
 o o o o o              o o o o o               o o o o o

x: 오분류              X: 가중치 증가          최종: 세 경계 결합

scikit-learn 구현¶

from sklearn.ensemble import AdaBoostClassifier
from sklearn.tree import DecisionTreeClassifier
from sklearn.datasets import make_classification
from sklearn.model_selection import train_test_split, cross_val_score
import matplotlib.pyplot as plt
import numpy as np

# 데이터
X, y = make_classification(n_samples=1000, n_features=20, n_informative=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# AdaBoost with Decision Stump
ada = AdaBoostClassifier(
    estimator=DecisionTreeClassifier(max_depth=1),  # Stump
    n_estimators=100,
    learning_rate=1.0,
    algorithm='SAMME',  # 'SAMME.R' for probability-based
    random_state=42
)

ada.fit(X_train, y_train)

print(f"Train Accuracy: {ada.score(X_train, y_train):.4f}")
print(f"Test Accuracy: {ada.score(X_test, y_test):.4f}")

# Staged prediction (각 라운드별 성능)
train_errors = []
test_errors = []

for y_pred_train, y_pred_test in zip(
    ada.staged_predict(X_train), 
    ada.staged_predict(X_test)
):
    train_errors.append(1 - np.mean(y_pred_train == y_train))
    test_errors.append(1 - np.mean(y_pred_test == y_test))

plt.figure(figsize=(10, 5))
plt.plot(train_errors, label='Train Error')
plt.plot(test_errors, label='Test Error')
plt.xlabel('Number of Estimators')
plt.ylabel('Error Rate')
plt.legend()
plt.title('AdaBoost Learning Curve')
plt.savefig('adaboost_curve.png', dpi=150)
plt.show()

Gradient Boosting¶

핵심 아이디어¶

손실 함수의 기울기(Gradient) 방향으로 모델을 순차 개선

일반적인 함수 근사:

\[F(x) = \sum_{m=1}^{M} \gamma_m h_m(x)\]

각 단계에서 잔차(Residual) 를 학습:

\[r_{im} = -\left[\frac{\partial L(y_i, F(x_i))}{\partial F(x_i)}\right]_{F=F_{m-1}}\]

알고리즘¶

Algorithm: Gradient Boosting

Input: 학습 데이터, 손실 함수 L, 반복 횟수 M
Output: 최종 모델 F_M(x)

# 초기화
F_0(x) = argmin_c sum L(y_i, c)

for m = 1 to M:
    # 1. Pseudo-residuals 계산
    r_im = -[dL(y_i, F(x_i)) / dF(x_i)] at F=F_{m-1}

    # 2. 잔차에 대해 기반 학습기 학습
    h_m = fit_base_learner(X, r)

    # 3. 최적 step size 계산 (line search)
    gamma_m = argmin_gamma sum L(y_i, F_{m-1}(x_i) + gamma * h_m(x_i))

    # 4. 모델 업데이트
    F_m(x) = F_{m-1}(x) + learning_rate * gamma_m * h_m(x)

return F_M

손실 함수별 Pseudo-Residual¶

손실 함수	수식	Pseudo-Residual
MSE (회귀)	$\frac{1}{2}(y - F)^2$	$y - F$ (실제 잔차)
MAE (회귀)	$	y - F
Log Loss (분류)	$-[y\log p + (1-y)\log(1-p)]$	$y - p$

scikit-learn 구현¶

from sklearn.ensemble import GradientBoostingClassifier, GradientBoostingRegressor
from sklearn.datasets import load_breast_cancer, fetch_california_housing
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, mean_squared_error
import numpy as np

# === 분류 ===
X, y = load_breast_cancer(return_X_y=True)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gb_clf = GradientBoostingClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=3,
    min_samples_split=2,
    min_samples_leaf=1,
    subsample=0.8,          # Stochastic Gradient Boosting
    max_features='sqrt',
    random_state=42
)

gb_clf.fit(X_train, y_train)
print(f"GBM Classification Accuracy: {gb_clf.score(X_test, y_test):.4f}")

# === 회귀 ===
X, y = fetch_california_housing(return_X_y=True)
X, y = X[:5000], y[:5000]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

gb_reg = GradientBoostingRegressor(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=4,
    subsample=0.8,
    loss='squared_error',  # 'absolute_error', 'huber', 'quantile'
    random_state=42
)

gb_reg.fit(X_train, y_train)
y_pred = gb_reg.predict(X_test)
print(f"GBM Regression RMSE: {np.sqrt(mean_squared_error(y_test, y_pred)):.4f}")

하이퍼파라미터 가이드¶

주요 파라미터¶

파라미터	설명	권장 범위
n_estimators	부스팅 라운드 수	100 ~ 1000+
learning_rate	각 트리의 기여도	0.01 ~ 0.3
max_depth	트리 최대 깊이	3 ~ 8
subsample	샘플 샘플링 비율	0.5 ~ 1.0
min_samples_split	분할 최소 샘플	2 ~ 20
min_samples_leaf	리프 최소 샘플	1 ~ 10

n_estimators vs learning_rate¶

두 파라미터는 트레이드오프 관계:

n_estimators * learning_rate ≈ constant (총 학습량)

- 낮은 learning_rate + 높은 n_estimators: 느리지만 안정적
- 높은 learning_rate + 낮은 n_estimators: 빠르지만 불안정

# 권장: 낮은 learning_rate, 많은 n_estimators
gb = GradientBoostingClassifier(
    n_estimators=500,
    learning_rate=0.05,
    max_depth=4
)

Early Stopping¶

from sklearn.model_selection import train_test_split

# 검증 세트 분리
X_train_sub, X_val, y_train_sub, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

gb = GradientBoostingClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    max_depth=4,
    validation_fraction=0.2,
    n_iter_no_change=10,  # Early stopping patience
    tol=1e-4,
    random_state=42
)

gb.fit(X_train, y_train)
print(f"Best n_estimators: {gb.n_estimators_}")

XGBoost와 LightGBM¶

Gradient Boosting의 발전된 구현:

특성	sklearn GBM	XGBoost	LightGBM
속도	느림	빠름	가장 빠름
메모리	높음	중간	낮음
정규화	제한적	L1, L2	L1, L2
결측치	전처리 필요	자동 처리	자동 처리
GPU	미지원	지원	지원
범주형	인코딩 필요	인코딩 필요	직접 지원

자세한 내용: - XGBoost - LightGBM

AdaBoost vs Gradient Boosting¶

특성	AdaBoost	Gradient Boosting
가중치 대상	샘플	잔차 (기울기)
손실 함수	지수 손실 고정	자유 선택
유연성	낮음	높음
이상치 민감도	높음	손실 함수에 따름
일반적 성능	보통	우수

언제 쓰나?¶

적합한 상황: - 테이블 데이터 (정형 데이터) - 높은 예측 성능 필요 - 특성 중요도 분석 - 대회 / 프로덕션 모델

주의 사항: - 과적합 위험 (regularization 필요) - 학습 속도 느림 (순차 학습) - 이상치에 민감할 수 있음

참고¶

Freund, Y. & Schapire, R.E. (1997). "A Decision-Theoretic Generalization of On-Line Learning and an Application to Boosting"
Friedman, J.H. (2001). "Greedy Function Approximation: A Gradient Boosting Machine"
Chen, T. & Guestrin, C. (2016). "XGBoost: A Scalable Tree Boosting System"
scikit-learn Gradient Boosting: https://scikit-learn.org/stable/modules/ensemble.html#gradient-boosting

주제	링크
앙상블 개요	README.md
Bagging	bagging.md
Stacking	stacking.md
XGBoost	../supervised/classification/xgboost.md
LightGBM	../supervised/classification/lightgbm.md
Random Forest	../supervised/classification/random-forest.md

손실 함수	수식	Pseudo-Residual
MSE (회귀)	\(\frac{1}{2}(y - F)^2\)	\(y - F\) (실제 잔차)
MAE (회귀)	$	y - F
Log Loss (분류)	\(-[y\log p + (1-y)\log(1-p)]\)	\(y - p\)

Boosting¶

논문 정보¶

개요¶

핵심 아이디어¶

Bagging vs Boosting¶

AdaBoost (Adaptive Boosting)¶

알고리즘¶

핵심 수식¶

시각화¶

scikit-learn 구현¶

Gradient Boosting¶

핵심 아이디어¶

알고리즘¶

손실 함수별 Pseudo-Residual¶

scikit-learn 구현¶

하이퍼파라미터 가이드¶

주요 파라미터¶

n_estimators vs learning_rate¶

Early Stopping¶

XGBoost와 LightGBM¶

AdaBoost vs Gradient Boosting¶

언제 쓰나?¶

관련 문서¶

참고¶