콘텐츠로 이동
Data Prep
상세

AutoML 개요

AutoML(Automated Machine Learning)은 머신러닝 파이프라인의 설계, 하이퍼파라미터 튜닝, 피처 엔지니어링 등을 자동화하는 기술. 전문 지식 없이도 고품질 모델 개발을 가능하게 함.


핵심 개념

AutoML 범위

AutoML Pipeline
├── Automated Data Preprocessing
│   ├── Missing value imputation
│   ├── Outlier detection
│   └── Feature scaling
├── Automated Feature Engineering
│   ├── Feature generation
│   ├── Feature selection
│   └── Feature transformation
├── Model Selection (CASH)
│   └── Combined Algorithm Selection and Hyperparameter optimization
├── Hyperparameter Optimization (HPO)
│   ├── Grid Search
│   ├── Random Search
│   ├── Bayesian Optimization
│   └── Evolutionary Algorithms
├── Neural Architecture Search (NAS)
│   ├── Architecture search space
│   ├── Search strategy
│   └── Performance estimation
└── Model Ensembling
    └── Stacking, Blending

하이퍼파라미터 최적화 (HPO)

모든 조합 탐색 (비효율적):

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500]
}

grid_search = GridSearchCV(
    estimator=XGBClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

무작위 샘플링 (효율적):

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.3),
    'n_estimators': randint(100, 1000)
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(),
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train, y_train)

Bayesian Optimization

이전 평가 결과를 활용한 효율적 탐색:

\[\text{surrogate model: } f(x) \sim GP(\mu(x), k(x, x'))\]

Acquisition Function: Expected Improvement, UCB 등

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
    }

    model = XGBClassifier(**params, random_state=42, use_label_encoder=False)

    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

참고 논문: - Bergstra, J. & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization". JMLR. - Snoek, J. et al. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". NeurIPS.


AutoML 프레임워크

Auto-sklearn

scikit-learn 생태계 기반:

import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,  # 1시간
    per_run_time_limit=300,
    ensemble_size=50,
    n_jobs=-1
)

automl.fit(X_train, y_train)
predictions = automl.predict(X_test)

# 모델 정보
print(automl.leaderboard())
print(automl.sprint_statistics())

H2O AutoML

import h2o
from h2o.automl import H2OAutoML

h2o.init()

train = h2o.H2OFrame(df_train)
test = h2o.H2OFrame(df_test)

aml = H2OAutoML(
    max_runtime_secs=3600,
    max_models=20,
    seed=42
)

aml.train(x=features, y=target, training_frame=train)

# 리더보드
lb = aml.leaderboard
print(lb.head())

# 최고 모델로 예측
predictions = aml.leader.predict(test)

FLAML (Microsoft)

가볍고 빠른 AutoML:

from flaml import AutoML

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=600,  # 10분
    metric="roc_auc",
    estimator_list=["lgbm", "xgboost", "rf", "extra_tree"]
)

print(f"Best model: {automl.best_estimator}")
print(f"Best score: {automl.best_loss}")
predictions = automl.predict(X_test)

PyCaret

Low-code ML 라이브러리:

from pycaret.classification import *

# 환경 설정
clf = setup(data=df, target='target', session_id=42)

# 모든 모델 비교
best_model = compare_models()

# 모델 튜닝
tuned_model = tune_model(best_model)

# 앙상블
ensemble_model = ensemble_model(tuned_model)

# 예측
predictions = predict_model(ensemble_model, data=test_df)

Neural Architecture Search (NAS)

탐색 공간

Search Space
├── Layer types (Conv, FC, RNN, Attention)
├── Layer connections (skip connections)
├── Hyperparameters (filters, kernel size)
└── Activation functions

탐색 전략

전략 설명 예시
Random Search 무작위 아키텍처 샘플링 -
Reinforcement Learning 컨트롤러가 아키텍처 생성 NASNet
Evolutionary 진화 알고리즘 AmoebaNet
One-shot / Weight Sharing 슈퍼넷 학습 DARTS, ENAS
Differentiable 연속 완화 후 최적화 DARTS
# DARTS 예시 (torch)
import nni.retiarii.nn.pytorch as nn
from nni.retiarii import model_wrapper

@model_wrapper
class SearchSpace(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.LayerChoice([
            nn.Conv2d(3, 32, 3, padding=1),
            nn.Conv2d(3, 32, 5, padding=2),
            nn.Identity()
        ])
        # ...

참고 논문: - Zoph, B. & Le, Q.V. (2017). "Neural Architecture Search with Reinforcement Learning". ICLR. - Liu, H. et al. (2019). "DARTS: Differentiable Architecture Search". ICLR.


실무 가이드

AutoML 선택

도구 장점 적합한 상황
Auto-sklearn scikit-learn 호환, 앙상블 정형 데이터, 연구
H2O AutoML 확장성, 다양한 알고리즘 대용량, 기업 환경
FLAML 빠름, 효율적 빠른 프로토타입
PyCaret Low-code, 시각화 빠른 실험, 교육
Optuna 유연한 HPO 커스텀 파이프라인

주의사항

  1. 과적합 위험: 검증 세트 분리 필수
  2. 시간/자원 제한: 적절한 budget 설정
  3. 해석가능성: 최종 모델 이해 필요
  4. 도메인 지식: AutoML은 보완재

참고 문헌

핵심 논문

  • Feurer, M. et al. (2015). "Efficient and Robust Automated Machine Learning". NeurIPS.
  • Zoph, B. & Le, Q.V. (2017). "NAS". ICLR.
  • Liu, H. et al. (2019). "DARTS". ICLR.

프레임워크

  • Auto-sklearn: https://automl.github.io/auto-sklearn/
  • H2O AutoML: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
  • FLAML: https://microsoft.github.io/FLAML/
  • Optuna: https://optuna.org/
  • PyCaret: https://pycaret.org/