XGBoost (eXtreme Gradient Boosting)¶

논문 정보¶

항목	내용
제목	XGBoost: A Scalable Tree Boosting System
저자	Tianqi Chen, Carlos Guestrin
학회/저널	KDD (ACM SIGKDD)
연도	2016
링크	https://arxiv.org/abs/1603.02754

개요¶

문제 정의¶

기존 Gradient Boosting의 한계를 극복:

계산 효율성: 대규모 데이터에서 속도 문제
메모리 효율성: 전체 데이터를 메모리에 적재
정규화: 과적합 방지 메커니즘 부족
결측치: 별도 전처리 필요

핵심 아이디어¶

정규화된 목적 함수: L1/L2 정규화로 모델 복잡도 제어
2차 근사 (Second-order approximation): 손실 함수의 2차 테일러 전개 활용
Sparsity-aware Algorithm: 결측치와 희소 데이터 효율적 처리
System Optimization: 캐시 최적화, 병렬/분산 처리, Out-of-core 학습

알고리즘/수식¶

목적 함수¶

Obj = sum_{i=1}^{n} L(y_i, y_hat_i) + sum_{t=1}^{T} Omega(f_t)

여기서: - L: 손실 함수 (MSE, Log loss 등) - Omega: 정규화 항 - T: 트리 수

정규화 항:

Omega(f) = gamma * |leaves| + (1/2) * lambda * sum_{j=1}^{|leaves|} w_j^2

gamma: 리프 수에 대한 페널티
lambda: 리프 가중치의 L2 정규화

가법적 학습 (Additive Training)¶

t번째 반복에서:

y_hat_i^(t) = y_hat_i^(t-1) + f_t(x_i)

목적 함수 근사 (2차 테일러 전개):

Obj^(t) ≈ sum_{i=1}^{n} [g_i * f_t(x_i) + (1/2) * h_i * f_t(x_i)^2] + Omega(f_t) + constant

여기서: - g_i = dL/d(y_hat_i^(t-1)): 1차 gradient - h_i = d^2L/d(y_hat_i^(t-1))^2: 2차 gradient (Hessian)

최적 리프 가중치와 분할 점수¶

j번째 리프의 최적 가중치:

w_j* = -G_j / (H_j + lambda)

여기서: - G_j = sum_{i in leaf_j} g_i - H_j = sum_{i in leaf_j} h_i

최적 목적 함수 값 (분할 품질 측정):

Obj* = -(1/2) * sum_{j=1}^{|leaves|} G_j^2 / (H_j + lambda) + gamma * |leaves|

분할 탐색¶

분할 이득 (Gain):

Gain = (1/2) * [G_L^2/(H_L + lambda) + G_R^2/(H_R + lambda) - (G_L + G_R)^2/(H_L + H_R + lambda)] - gamma

Gain > 0이면 분할 실행, 아니면 리프로 유지.

결측치 처리¶

Algorithm: Sparsity-aware Split Finding

for each feature:
    # 결측치를 왼쪽으로 보내는 경우
    Gain_left = calculate_gain(missing -> left)

    # 결측치를 오른쪽으로 보내는 경우
    Gain_right = calculate_gain(missing -> right)

    # 더 높은 gain을 주는 방향 선택
    default_direction = argmax(Gain_left, Gain_right)

시간 복잡도¶

단계	복잡도
Exact greedy	O(n * d * n * log(n)) per tree
Histogram-based	O(n * d * bins) per tree

하이퍼파라미터 가이드¶

핵심 파라미터¶

파라미터	설명	권장 범위	기본값
n_estimators	부스팅 라운드 수	100 ~ 1000	100
learning_rate (eta)	학습률	0.01 ~ 0.3	0.3
max_depth	트리 최대 깊이	3 ~ 10	6
min_child_weight	리프 최소 Hessian 합	1 ~ 10	1
subsample	행 샘플링 비율	0.5 ~ 1.0	1.0
colsample_bytree	트리당 열 샘플링	0.5 ~ 1.0	1.0
colsample_bylevel	레벨당 열 샘플링	0.5 ~ 1.0	1.0

정규화 파라미터¶

파라미터	설명	권장 범위	기본값
gamma	분할 최소 손실 감소	0 ~ 5	0
lambda (reg_lambda)	L2 정규화	0 ~ 10	1
alpha (reg_alpha)	L1 정규화	0 ~ 10	0

튜닝 전략¶

1단계: 고정 학습률로 트리 수 결정
   - learning_rate = 0.1
   - 조기 종료로 최적 n_estimators 찾기

2단계: 트리 구조 파라미터 튜닝
   - max_depth: [3, 5, 7, 9]
   - min_child_weight: [1, 3, 5]

3단계: 샘플링 파라미터 튜닝
   - subsample: [0.6, 0.8, 1.0]
   - colsample_bytree: [0.6, 0.8, 1.0]

4단계: 정규화 파라미터 튜닝
   - gamma: [0, 0.1, 0.2]
   - lambda: [0, 1, 5]

5단계: 학습률 낮추고 트리 수 증가
   - learning_rate = 0.01 ~ 0.05
   - n_estimators 비례 증가

Python 코드 예시¶

기본 사용법¶

import numpy as np
import pandas as pd
import xgboost as xgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score, classification_report, roc_auc_score
import matplotlib.pyplot as plt

# 데이터 로드
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# XGBoost 분류기
model = xgb.XGBClassifier(
    n_estimators=100,
    learning_rate=0.1,
    max_depth=5,
    min_child_weight=1,
    subsample=0.8,
    colsample_bytree=0.8,
    gamma=0,
    reg_lambda=1,
    reg_alpha=0,
    objective='binary:logistic',
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42,
    n_jobs=-1
)

# 학습
model.fit(X_train, y_train)

# 예측 및 평가
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("=== XGBoost Classification ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

조기 종료 (Early Stopping)¶

# Train/Validation split
X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

model_es = xgb.XGBClassifier(
    n_estimators=1000,
    learning_rate=0.1,
    max_depth=5,
    early_stopping_rounds=50,
    eval_metric='logloss',
    use_label_encoder=False,
    random_state=42
)

model_es.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    verbose=False
)

print(f"Best iteration: {model_es.best_iteration}")
print(f"Best score: {model_es.best_score:.4f}")

특성 중요도¶

# 여러 중요도 타입
importance_types = ['weight', 'gain', 'cover']

fig, axes = plt.subplots(1, 3, figsize=(18, 6))

for ax, imp_type in zip(axes, importance_types):
    importance = model.get_booster().get_score(importance_type=imp_type)
    importance_df = pd.DataFrame({
        'feature': importance.keys(),
        'importance': importance.values()
    }).sort_values('importance', ascending=True).tail(15)

    ax.barh(importance_df['feature'], importance_df['importance'])
    ax.set_title(f'Feature Importance ({imp_type})')

plt.tight_layout()
plt.savefig('xgb_importance.png', dpi=150)
plt.show()

학습 곡선 시각화¶

# 학습 이력 저장
evals_result = {}

model_curve = xgb.XGBClassifier(
    n_estimators=200,
    learning_rate=0.1,
    max_depth=5,
    eval_metric=['logloss', 'auc'],
    use_label_encoder=False,
    random_state=42
)

model_curve.fit(
    X_tr, y_tr,
    eval_set=[(X_tr, y_tr), (X_val, y_val)],
    verbose=False
)

results = model_curve.evals_result()

# 시각화
fig, axes = plt.subplots(1, 2, figsize=(14, 5))

# Log Loss
axes[0].plot(results['validation_0']['logloss'], label='Train')
axes[0].plot(results['validation_1']['logloss'], label='Validation')
axes[0].set_xlabel('Boosting Round')
axes[0].set_ylabel('Log Loss')
axes[0].set_title('Learning Curve - Log Loss')
axes[0].legend()

# AUC
axes[1].plot(results['validation_0']['auc'], label='Train')
axes[1].plot(results['validation_1']['auc'], label='Validation')
axes[1].set_xlabel('Boosting Round')
axes[1].set_ylabel('AUC')
axes[1].set_title('Learning Curve - AUC')
axes[1].legend()

plt.tight_layout()
plt.savefig('xgb_learning_curve.png', dpi=150)
plt.show()

하이퍼파라미터 튜닝 (Optuna)¶

import optuna
from sklearn.model_selection import cross_val_score

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'max_depth': trial.suggest_int('max_depth', 3, 10),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_child_weight': trial.suggest_int('min_child_weight', 1, 10),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'gamma': trial.suggest_float('gamma', 0, 5),
        'reg_lambda': trial.suggest_float('reg_lambda', 0.01, 10, log=True),
        'reg_alpha': trial.suggest_float('reg_alpha', 0.01, 10, log=True),
        'use_label_encoder': False,
        'eval_metric': 'logloss',
        'random_state': 42,
        'n_jobs': -1
    }

    model = xgb.XGBClassifier(**params)
    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    return scores.mean()

# 최적화 실행
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest trial:")
print(f"  AUC: {study.best_trial.value:.4f}")
print(f"  Params: {study.best_trial.params}")

Native API 사용¶

# DMatrix 생성
dtrain = xgb.DMatrix(X_train, label=y_train, feature_names=data.feature_names)
dtest = xgb.DMatrix(X_test, label=y_test, feature_names=data.feature_names)

# 파라미터 설정
params = {
    'objective': 'binary:logistic',
    'eval_metric': ['logloss', 'auc'],
    'max_depth': 5,
    'eta': 0.1,
    'subsample': 0.8,
    'colsample_bytree': 0.8,
    'seed': 42
}

# 학습
evals = [(dtrain, 'train'), (dtest, 'test')]
bst = xgb.train(
    params,
    dtrain,
    num_boost_round=200,
    evals=evals,
    early_stopping_rounds=50,
    verbose_eval=False
)

print(f"Best iteration: {bst.best_iteration}")
print(f"Best score: {bst.best_score:.4f}")

GPU 학습¶

# GPU 사용 (CUDA 필요)
model_gpu = xgb.XGBClassifier(
    n_estimators=100,
    tree_method='gpu_hist',  # GPU 히스토그램
    predictor='gpu_predictor',
    gpu_id=0,
    use_label_encoder=False,
    random_state=42
)

# GPU 가용 여부 확인 후 학습
import subprocess
try:
    subprocess.check_output(['nvidia-smi'])
    model_gpu.fit(X_train, y_train)
    print("GPU training completed")
except:
    print("GPU not available, skipping")

언제 쓰나?¶

적합한 상황: - Kaggle 등 대회 (높은 성능 필수) - 정형 데이터 분류/회귀 - 대규모 데이터셋 (분산/병렬 처리) - 결측치가 많은 데이터 - 특성 중요도 분석

부적합한 상황: - 실시간 예측 (Inference 속도 중요) - 매우 작은 데이터셋 (과적합 위험) - 이미지/텍스트 등 비정형 데이터 - 해석 가능성이 최우선인 경우

장단점¶

장점	단점
최고 수준 성능 (정형 데이터)	하이퍼파라미터 많음
정규화 내장	해석 어려움
결측치 자동 처리	순차 학습 (병렬화 한계)
GPU 지원	작은 데이터에서 과적합
조기 종료	메모리 사용량 높음
분산 학습 지원	학습 시간 긴 편
다양한 손실 함수