LightGBM (Light Gradient Boosting Machine)¶

논문 정보¶

항목	내용
제목	LightGBM: A Highly Efficient Gradient Boosting Decision Tree
저자	Guolin Ke, Qi Meng, Thomas Finley, Taifeng Wang, Wei Chen, Weidong Ma, Qiwei Ye, Tie-Yan Liu
학회/저널	NeurIPS (NIPS)
연도	2017
링크	https://proceedings.neurips.cc/paper/2017/hash/6449f44a102fde848669bdd9eb6b76fa-Abstract.html

개요¶

문제 정의¶

기존 GBDT (Gradient Boosting Decision Tree)의 효율성 문제:

대규모 데이터에서 학습 시간이 너무 김
모든 데이터/특성을 탐색하는 비효율
메모리 사용량 과다

핵심 아이디어¶

두 가지 핵심 기법:

GOSS (Gradient-based One-Side Sampling):
Gradient가 큰 샘플은 모두 유지
Gradient가 작은 샘플은 무작위 샘플링
정보량 손실 최소화하며 데이터 감소
EFB (Exclusive Feature Bundling):
상호 배타적인 특성들을 하나로 묶음
희소 데이터에서 특성 수 대폭 감소
그래프 컬러링 문제로 해결

추가 최적화:

Leaf-wise 트리 성장 (vs. Level-wise)
Histogram-based 분할 탐색
병렬/분산 학습 최적화

알고리즘/수식¶

GOSS (Gradient-based One-Side Sampling)¶

Algorithm: GOSS

Input: 학습 데이터, 상위 gradient 비율 a, 하위 샘플링 비율 b
Output: 샘플링된 데이터

1. Sort samples by |gradient| in descending order

2. Select top a * 100% samples (high gradient) -> set A

3. Randomly sample b * 100% from remaining samples -> set B

4. Amplify weight of samples in B by (1-a)/b
   (보정 계수: 언더샘플링된 작은 gradient 보상)

5. Return A union B with adjusted weights

왜 작동하는가: - Gradient가 큰 샘플: 학습에 더 많은 정보 기여 - Gradient가 작은 샘플: 이미 잘 학습됨, 일부만 필요 - 가중치 보정으로 편향 최소화

EFB (Exclusive Feature Bundling)¶

배경: 희소 특성들은 동시에 0이 아닌 값을 갖는 경우가 드묾

Algorithm: EFB

1. 특성 간 충돌 그래프 구성
   - 노드: 각 특성
   - 엣지: 동시에 non-zero인 샘플 수 (충돌)

2. 그래프 컬러링으로 번들 생성
   - 충돌이 적은 특성들을 같은 번들로
   - NP-hard -> 탐욕적 근사

3. 번들 내 특성들을 하나로 병합
   - 값 범위를 offset으로 구분
   - feature_1: [0, 10], feature_2: [0, 5]
   - bundle: feature_1 + (feature_2 + 10)

Leaf-wise vs Level-wise 트리 성장¶

Level-wise (XGBoost 등):
    레벨 1:    [    root    ]
    레벨 2:  [  L  ]    [  R  ]
    레벨 3: [a][b]     [c][d]
    -> 균형 잡힌 트리, 불필요한 분할 가능

Leaf-wise (LightGBM):
    1. 최대 손실 감소 리프 선택
    2. 해당 리프만 분할
    3. 반복
    -> 비대칭 트리, 더 적은 분할로 같은 손실 감소
    -> 과적합 주의 (max_depth 또는 num_leaves 제한)

Histogram-based 분할¶

1. 연속 특성을 K개 bin으로 이산화 (기본 255)
2. Histogram 누적으로 O(n) -> O(K)
3. Histogram subtraction: 
   child_hist = parent_hist - sibling_hist
   -> 더 작은 자식만 계산, 큰 자식은 차감으로

시간 복잡도¶

방법	복잡도
Pre-sorted (XGBoost Exact)	O(n * d) per split
Histogram (LightGBM)	O(bins * d) per split
With GOSS	O(an + bn) * bins * d
With EFB	O(bins * d') where d' << d

하이퍼파라미터 가이드¶

핵심 파라미터¶

파라미터	설명	권장 범위	기본값
num_leaves	리프 수 (트리 복잡도)	20 ~ 150	31
max_depth	최대 깊이	-1 (무제한) 또는 5 ~ 15	-1
learning_rate	학습률	0.01 ~ 0.3	0.1
n_estimators	부스팅 라운드	100 ~ 1000	100
min_child_samples	리프 최소 샘플	10 ~ 100	20
subsample (bagging_fraction)	행 샘플링	0.5 ~ 1.0	1.0
colsample_bytree (feature_fraction)	열 샘플링	0.5 ~ 1.0	1.0

정규화 파라미터¶

파라미터	설명	권장 범위	기본값
reg_alpha (lambda_l1)	L1 정규화	0 ~ 10	0
reg_lambda (lambda_l2)	L2 정규화	0 ~ 10	0
min_gain_to_split	분할 최소 gain	0 ~ 1	0
min_child_weight	리프 최소 Hessian	0.001 ~ 10	0.001

GOSS/EFB 파라미터¶

파라미터	설명	권장값
boosting	부스팅 타입	'gbdt', 'goss', 'dart'
top_rate	GOSS 상위 비율 (a)	0.2
other_rate	GOSS 하위 샘플링 비율 (b)	0.1
enable_bundle	EFB 활성화	True
max_conflict_rate	EFB 최대 충돌율	0

num_leaves vs max_depth¶

관계: num_leaves <= 2^max_depth

추천:
- num_leaves를 주로 조정
- max_depth는 안전장치로 사용
- num_leaves = 2^max_depth 피하기 (균형 트리 = LightGBM 장점 상실)

예시:
- max_depth=7이면 num_leaves는 70~100 정도 (128보다 작게)

Python 코드 예시¶

기본 사용법¶

import numpy as np
import pandas as pd
import lightgbm as lgb
from sklearn.datasets import load_breast_cancer
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, roc_auc_score, classification_report

# 데이터 로드
data = load_breast_cancer()
X, y = data.data, data.target

X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

# LightGBM 분류기
model = lgb.LGBMClassifier(
    n_estimators=100,
    learning_rate=0.1,
    num_leaves=31,
    max_depth=-1,
    min_child_samples=20,
    subsample=0.8,
    colsample_bytree=0.8,
    reg_alpha=0,
    reg_lambda=1,
    random_state=42,
    n_jobs=-1,
    verbose=-1
)

model.fit(X_train, y_train)

# 평가
y_pred = model.predict(X_test)
y_prob = model.predict_proba(X_test)[:, 1]

print("=== LightGBM Classification ===")
print(f"Accuracy: {accuracy_score(y_test, y_pred):.4f}")
print(f"ROC-AUC: {roc_auc_score(y_test, y_prob):.4f}")

조기 종료¶

X_tr, X_val, y_tr, y_val = train_test_split(
    X_train, y_train, test_size=0.2, random_state=42
)

model_es = lgb.LGBMClassifier(
    n_estimators=1000,
    learning_rate=0.05,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

model_es.fit(
    X_tr, y_tr,
    eval_set=[(X_val, y_val)],
    eval_metric='logloss',
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=0)
    ]
)

print(f"Best iteration: {model_es.best_iteration_}")
print(f"Best score: {model_es.best_score_['valid_0']['binary_logloss']:.4f}")

Native API 사용¶

# Dataset 생성
train_data = lgb.Dataset(X_tr, label=y_tr)
valid_data = lgb.Dataset(X_val, label=y_val, reference=train_data)

# 파라미터
params = {
    'objective': 'binary',
    'metric': ['binary_logloss', 'auc'],
    'boosting': 'gbdt',
    'num_leaves': 31,
    'learning_rate': 0.05,
    'feature_fraction': 0.8,
    'bagging_fraction': 0.8,
    'bagging_freq': 5,
    'verbose': -1,
    'seed': 42
}

# 학습
bst = lgb.train(
    params,
    train_data,
    num_boost_round=1000,
    valid_sets=[train_data, valid_data],
    valid_names=['train', 'valid'],
    callbacks=[
        lgb.early_stopping(stopping_rounds=50),
        lgb.log_evaluation(period=100)
    ]
)

print(f"Best iteration: {bst.best_iteration}")

특성 중요도¶

# 두 가지 중요도
importance_split = model.booster_.feature_importance(importance_type='split')
importance_gain = model.booster_.feature_importance(importance_type='gain')

importance_df = pd.DataFrame({
    'feature': data.feature_names,
    'split': importance_split,
    'gain': importance_gain
}).sort_values('gain', ascending=False)

print("\nFeature Importance (Top 10):")
print(importance_df.head(10).to_string(index=False))

# 시각화
fig, axes = plt.subplots(1, 2, figsize=(14, 6))

lgb.plot_importance(model, importance_type='split', ax=axes[0], 
                    title='Split Importance', max_num_features=15)
lgb.plot_importance(model, importance_type='gain', ax=axes[1],
                    title='Gain Importance', max_num_features=15)

plt.tight_layout()
plt.savefig('lgbm_importance.png', dpi=150)
plt.show()

GOSS 사용¶

# GOSS (Gradient-based One-Side Sampling)
model_goss = lgb.LGBMClassifier(
    n_estimators=100,
    boosting_type='goss',  # GOSS 활성화
    top_rate=0.2,          # 상위 20% 유지
    other_rate=0.1,        # 나머지 중 10% 샘플링
    learning_rate=0.1,
    num_leaves=31,
    random_state=42,
    verbose=-1
)

model_goss.fit(X_train, y_train)
y_pred_goss = model_goss.predict(X_test)
print(f"GOSS Accuracy: {accuracy_score(y_test, y_pred_goss):.4f}")

범주형 특성 처리¶

# 범주형 특성이 있는 경우
# LightGBM은 범주형을 직접 처리 가능 (One-hot 불필요)

# 예시 데이터 생성
df = pd.DataFrame({
    'num_feature': np.random.randn(1000),
    'cat_feature': np.random.choice(['A', 'B', 'C'], 1000)
})
y_cat = np.random.randint(0, 2, 1000)

# 범주형을 category 타입으로 변환
df['cat_feature'] = df['cat_feature'].astype('category')

model_cat = lgb.LGBMClassifier(verbose=-1)
model_cat.fit(
    df, y_cat,
    categorical_feature=['cat_feature']
)

print("Categorical feature handled natively")

하이퍼파라미터 튜닝 (Optuna)¶

import optuna

def objective(trial):
    params = {
        'n_estimators': trial.suggest_int('n_estimators', 100, 500),
        'num_leaves': trial.suggest_int('num_leaves', 20, 150),
        'max_depth': trial.suggest_int('max_depth', 3, 12),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'min_child_samples': trial.suggest_int('min_child_samples', 5, 100),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
        'random_state': 42,
        'verbose': -1,
        'n_jobs': -1
    }

    model = lgb.LGBMClassifier(**params)

    # Early stopping을 위한 split
    X_tr, X_val, y_tr, y_val = train_test_split(
        X_train, y_train, test_size=0.2, random_state=42
    )

    model.fit(
        X_tr, y_tr,
        eval_set=[(X_val, y_val)],
        callbacks=[
            lgb.early_stopping(50),
            lgb.log_evaluation(0)
        ]
    )

    y_prob = model.predict_proba(X_val)[:, 1]
    return roc_auc_score(y_val, y_prob)

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=50, show_progress_bar=True)

print(f"\nBest AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

언제 쓰나?¶

적합한 상황: - 대규모 데이터셋 (수백만 행) - 고차원 희소 데이터 (텍스트 특성 등) - 빠른 학습이 필요한 경우 - 범주형 특성이 많은 경우 - 메모리 제약이 있는 환경

부적합한 상황: - 매우 작은 데이터셋 (< 1000) - Leaf-wise로 인한 과적합 위험 - 균형 잡힌 트리가 필요한 경우

LightGBM vs XGBoost¶

항목	LightGBM	XGBoost
트리 성장	Leaf-wise	Level-wise
속도	빠름	보통
메모리	적음	많음
범주형 처리	Native 지원	One-hot 필요
작은 데이터	과적합 위험	안정적
대규모 데이터	최적	느림
GPU 지원	있음	있음

장단점¶

장점	단점
매우 빠른 학습	작은 데이터에서 과적합
낮은 메모리 사용	num_leaves 튜닝 중요
범주형 직접 처리	Level-wise보다 불안정할 수 있음
희소 데이터 효율적
GOSS/EFB 최적화
높은 정확도