AutoML 개요¶

AutoML(Automated Machine Learning)은 머신러닝 파이프라인의 설계, 하이퍼파라미터 튜닝, 피처 엔지니어링 등을 자동화하는 기술. 전문 지식 없이도 고품질 모델 개발을 가능하게 함.

핵심 개념¶

AutoML 범위¶

AutoML Pipeline
├── Automated Data Preprocessing
│   ├── Missing value imputation
│   ├── Outlier detection
│   └── Feature scaling
├── Automated Feature Engineering
│   ├── Feature generation
│   ├── Feature selection
│   └── Feature transformation
├── Model Selection (CASH)
│   └── Combined Algorithm Selection and Hyperparameter optimization
├── Hyperparameter Optimization (HPO)
│   ├── Grid Search
│   ├── Random Search
│   ├── Bayesian Optimization
│   └── Evolutionary Algorithms
├── Neural Architecture Search (NAS)
│   ├── Architecture search space
│   ├── Search strategy
│   └── Performance estimation
└── Model Ensembling
    └── Stacking, Blending

하이퍼파라미터 최적화 (HPO)¶

Grid Search¶

모든 조합 탐색 (비효율적):

from sklearn.model_selection import GridSearchCV

param_grid = {
    'max_depth': [3, 5, 7, 10],
    'learning_rate': [0.01, 0.05, 0.1],
    'n_estimators': [100, 200, 500]
}

grid_search = GridSearchCV(
    estimator=XGBClassifier(),
    param_grid=param_grid,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")

Random Search¶

무작위 샘플링 (효율적):

from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint

param_dist = {
    'max_depth': randint(3, 15),
    'learning_rate': uniform(0.01, 0.3),
    'n_estimators': randint(100, 1000)
}

random_search = RandomizedSearchCV(
    estimator=XGBClassifier(),
    param_distributions=param_dist,
    n_iter=100,
    cv=5,
    scoring='roc_auc',
    n_jobs=-1,
    random_state=42
)
random_search.fit(X_train, y_train)

Bayesian Optimization¶

이전 평가 결과를 활용한 효율적 탐색:

\[\text{surrogate model: } f(x) \sim GP(\mu(x), k(x, x'))\]

Acquisition Function: Expected Improvement, UCB 등

import optuna

def objective(trial):
    params = {
        'max_depth': trial.suggest_int('max_depth', 3, 15),
        'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
        'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
        'subsample': trial.suggest_float('subsample', 0.5, 1.0),
        'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
        'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
        'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
    }

    model = XGBClassifier(**params, random_state=42, use_label_encoder=False)

    scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
    return scores.mean()

study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)

print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")

참고 논문: - Bergstra, J. & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization". JMLR. - Snoek, J. et al. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". NeurIPS.

AutoML 프레임워크¶

Auto-sklearn¶

scikit-learn 생태계 기반:

import autosklearn.classification

automl = autosklearn.classification.AutoSklearnClassifier(
    time_left_for_this_task=3600,  # 1시간
    per_run_time_limit=300,
    ensemble_size=50,
    n_jobs=-1
)

automl.fit(X_train, y_train)
predictions = automl.predict(X_test)

# 모델 정보
print(automl.leaderboard())
print(automl.sprint_statistics())

H2O AutoML¶

import h2o
from h2o.automl import H2OAutoML

h2o.init()

train = h2o.H2OFrame(df_train)
test = h2o.H2OFrame(df_test)

aml = H2OAutoML(
    max_runtime_secs=3600,
    max_models=20,
    seed=42
)

aml.train(x=features, y=target, training_frame=train)

# 리더보드
lb = aml.leaderboard
print(lb.head())

# 최고 모델로 예측
predictions = aml.leader.predict(test)

FLAML (Microsoft)¶

가볍고 빠른 AutoML:

from flaml import AutoML

automl = AutoML()
automl.fit(
    X_train, y_train,
    task="classification",
    time_budget=600,  # 10분
    metric="roc_auc",
    estimator_list=["lgbm", "xgboost", "rf", "extra_tree"]
)

print(f"Best model: {automl.best_estimator}")
print(f"Best score: {automl.best_loss}")
predictions = automl.predict(X_test)

PyCaret¶

Low-code ML 라이브러리:

from pycaret.classification import *

# 환경 설정
clf = setup(data=df, target='target', session_id=42)

# 모든 모델 비교
best_model = compare_models()

# 모델 튜닝
tuned_model = tune_model(best_model)

# 앙상블
ensemble_model = ensemble_model(tuned_model)

# 예측
predictions = predict_model(ensemble_model, data=test_df)

Neural Architecture Search (NAS)¶

탐색 공간¶

Search Space
├── Layer types (Conv, FC, RNN, Attention)
├── Layer connections (skip connections)
├── Hyperparameters (filters, kernel size)
└── Activation functions

탐색 전략¶

전략	설명	예시
Random Search	무작위 아키텍처 샘플링	-
Reinforcement Learning	컨트롤러가 아키텍처 생성	NASNet
Evolutionary	진화 알고리즘	AmoebaNet
One-shot / Weight Sharing	슈퍼넷 학습	DARTS, ENAS
Differentiable	연속 완화 후 최적화	DARTS

# DARTS 예시 (torch)
import nni.retiarii.nn.pytorch as nn
from nni.retiarii import model_wrapper

@model_wrapper
class SearchSpace(nn.Module):
    def __init__(self):
        super().__init__()
        self.conv1 = nn.LayerChoice([
            nn.Conv2d(3, 32, 3, padding=1),
            nn.Conv2d(3, 32, 5, padding=2),
            nn.Identity()
        ])
        # ...

참고 논문: - Zoph, B. & Le, Q.V. (2017). "Neural Architecture Search with Reinforcement Learning". ICLR. - Liu, H. et al. (2019). "DARTS: Differentiable Architecture Search". ICLR.

실무 가이드¶

AutoML 선택¶

도구	장점	적합한 상황
Auto-sklearn	scikit-learn 호환, 앙상블	정형 데이터, 연구
H2O AutoML	확장성, 다양한 알고리즘	대용량, 기업 환경
FLAML	빠름, 효율적	빠른 프로토타입
PyCaret	Low-code, 시각화	빠른 실험, 교육
Optuna	유연한 HPO	커스텀 파이프라인

주의사항¶

과적합 위험: 검증 세트 분리 필수
시간/자원 제한: 적절한 budget 설정
해석가능성: 최종 모델 이해 필요
도메인 지식: AutoML은 보완재

참고 문헌¶

핵심 논문¶

Feurer, M. et al. (2015). "Efficient and Robust Automated Machine Learning". NeurIPS.
Zoph, B. & Le, Q.V. (2017). "NAS". ICLR.
Liu, H. et al. (2019). "DARTS". ICLR.

프레임워크¶

Auto-sklearn: https://automl.github.io/auto-sklearn/
H2O AutoML: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
FLAML: https://microsoft.github.io/FLAML/
Optuna: https://optuna.org/
PyCaret: https://pycaret.org/