AutoML 개요¶
AutoML(Automated Machine Learning)은 머신러닝 파이프라인의 설계, 하이퍼파라미터 튜닝, 피처 엔지니어링 등을 자동화하는 기술. 전문 지식 없이도 고품질 모델 개발을 가능하게 함.
핵심 개념¶
AutoML 범위¶
AutoML Pipeline
├── Automated Data Preprocessing
│ ├── Missing value imputation
│ ├── Outlier detection
│ └── Feature scaling
├── Automated Feature Engineering
│ ├── Feature generation
│ ├── Feature selection
│ └── Feature transformation
├── Model Selection (CASH)
│ └── Combined Algorithm Selection and Hyperparameter optimization
├── Hyperparameter Optimization (HPO)
│ ├── Grid Search
│ ├── Random Search
│ ├── Bayesian Optimization
│ └── Evolutionary Algorithms
├── Neural Architecture Search (NAS)
│ ├── Architecture search space
│ ├── Search strategy
│ └── Performance estimation
└── Model Ensembling
└── Stacking, Blending
하이퍼파라미터 최적화 (HPO)¶
Grid Search¶
모든 조합 탐색 (비효율적):
from sklearn.model_selection import GridSearchCV
param_grid = {
'max_depth': [3, 5, 7, 10],
'learning_rate': [0.01, 0.05, 0.1],
'n_estimators': [100, 200, 500]
}
grid_search = GridSearchCV(
estimator=XGBClassifier(),
param_grid=param_grid,
cv=5,
scoring='roc_auc',
n_jobs=-1
)
grid_search.fit(X_train, y_train)
print(f"Best params: {grid_search.best_params_}")
Random Search¶
무작위 샘플링 (효율적):
from sklearn.model_selection import RandomizedSearchCV
from scipy.stats import uniform, randint
param_dist = {
'max_depth': randint(3, 15),
'learning_rate': uniform(0.01, 0.3),
'n_estimators': randint(100, 1000)
}
random_search = RandomizedSearchCV(
estimator=XGBClassifier(),
param_distributions=param_dist,
n_iter=100,
cv=5,
scoring='roc_auc',
n_jobs=-1,
random_state=42
)
random_search.fit(X_train, y_train)
Bayesian Optimization¶
이전 평가 결과를 활용한 효율적 탐색:
\[\text{surrogate model: } f(x) \sim GP(\mu(x), k(x, x'))\]
Acquisition Function: Expected Improvement, UCB 등
import optuna
def objective(trial):
params = {
'max_depth': trial.suggest_int('max_depth', 3, 15),
'learning_rate': trial.suggest_float('learning_rate', 0.01, 0.3, log=True),
'n_estimators': trial.suggest_int('n_estimators', 100, 1000),
'subsample': trial.suggest_float('subsample', 0.5, 1.0),
'colsample_bytree': trial.suggest_float('colsample_bytree', 0.5, 1.0),
'reg_alpha': trial.suggest_float('reg_alpha', 1e-8, 10.0, log=True),
'reg_lambda': trial.suggest_float('reg_lambda', 1e-8, 10.0, log=True),
}
model = XGBClassifier(**params, random_state=42, use_label_encoder=False)
scores = cross_val_score(model, X_train, y_train, cv=5, scoring='roc_auc')
return scores.mean()
study = optuna.create_study(direction='maximize')
study.optimize(objective, n_trials=100, timeout=3600)
print(f"Best AUC: {study.best_value:.4f}")
print(f"Best params: {study.best_params}")
참고 논문: - Bergstra, J. & Bengio, Y. (2012). "Random Search for Hyper-Parameter Optimization". JMLR. - Snoek, J. et al. (2012). "Practical Bayesian Optimization of Machine Learning Algorithms". NeurIPS.
AutoML 프레임워크¶
Auto-sklearn¶
scikit-learn 생태계 기반:
import autosklearn.classification
automl = autosklearn.classification.AutoSklearnClassifier(
time_left_for_this_task=3600, # 1시간
per_run_time_limit=300,
ensemble_size=50,
n_jobs=-1
)
automl.fit(X_train, y_train)
predictions = automl.predict(X_test)
# 모델 정보
print(automl.leaderboard())
print(automl.sprint_statistics())
H2O AutoML¶
import h2o
from h2o.automl import H2OAutoML
h2o.init()
train = h2o.H2OFrame(df_train)
test = h2o.H2OFrame(df_test)
aml = H2OAutoML(
max_runtime_secs=3600,
max_models=20,
seed=42
)
aml.train(x=features, y=target, training_frame=train)
# 리더보드
lb = aml.leaderboard
print(lb.head())
# 최고 모델로 예측
predictions = aml.leader.predict(test)
FLAML (Microsoft)¶
가볍고 빠른 AutoML:
from flaml import AutoML
automl = AutoML()
automl.fit(
X_train, y_train,
task="classification",
time_budget=600, # 10분
metric="roc_auc",
estimator_list=["lgbm", "xgboost", "rf", "extra_tree"]
)
print(f"Best model: {automl.best_estimator}")
print(f"Best score: {automl.best_loss}")
predictions = automl.predict(X_test)
PyCaret¶
Low-code ML 라이브러리:
from pycaret.classification import *
# 환경 설정
clf = setup(data=df, target='target', session_id=42)
# 모든 모델 비교
best_model = compare_models()
# 모델 튜닝
tuned_model = tune_model(best_model)
# 앙상블
ensemble_model = ensemble_model(tuned_model)
# 예측
predictions = predict_model(ensemble_model, data=test_df)
Neural Architecture Search (NAS)¶
탐색 공간¶
Search Space
├── Layer types (Conv, FC, RNN, Attention)
├── Layer connections (skip connections)
├── Hyperparameters (filters, kernel size)
└── Activation functions
탐색 전략¶
| 전략 | 설명 | 예시 |
|---|---|---|
| Random Search | 무작위 아키텍처 샘플링 | - |
| Reinforcement Learning | 컨트롤러가 아키텍처 생성 | NASNet |
| Evolutionary | 진화 알고리즘 | AmoebaNet |
| One-shot / Weight Sharing | 슈퍼넷 학습 | DARTS, ENAS |
| Differentiable | 연속 완화 후 최적화 | DARTS |
# DARTS 예시 (torch)
import nni.retiarii.nn.pytorch as nn
from nni.retiarii import model_wrapper
@model_wrapper
class SearchSpace(nn.Module):
def __init__(self):
super().__init__()
self.conv1 = nn.LayerChoice([
nn.Conv2d(3, 32, 3, padding=1),
nn.Conv2d(3, 32, 5, padding=2),
nn.Identity()
])
# ...
참고 논문: - Zoph, B. & Le, Q.V. (2017). "Neural Architecture Search with Reinforcement Learning". ICLR. - Liu, H. et al. (2019). "DARTS: Differentiable Architecture Search". ICLR.
실무 가이드¶
AutoML 선택¶
| 도구 | 장점 | 적합한 상황 |
|---|---|---|
| Auto-sklearn | scikit-learn 호환, 앙상블 | 정형 데이터, 연구 |
| H2O AutoML | 확장성, 다양한 알고리즘 | 대용량, 기업 환경 |
| FLAML | 빠름, 효율적 | 빠른 프로토타입 |
| PyCaret | Low-code, 시각화 | 빠른 실험, 교육 |
| Optuna | 유연한 HPO | 커스텀 파이프라인 |
주의사항¶
- 과적합 위험: 검증 세트 분리 필수
- 시간/자원 제한: 적절한 budget 설정
- 해석가능성: 최종 모델 이해 필요
- 도메인 지식: AutoML은 보완재
참고 문헌¶
핵심 논문¶
- Feurer, M. et al. (2015). "Efficient and Robust Automated Machine Learning". NeurIPS.
- Zoph, B. & Le, Q.V. (2017). "NAS". ICLR.
- Liu, H. et al. (2019). "DARTS". ICLR.
프레임워크¶
- Auto-sklearn: https://automl.github.io/auto-sklearn/
- H2O AutoML: https://docs.h2o.ai/h2o/latest-stable/h2o-docs/automl.html
- FLAML: https://microsoft.github.io/FLAML/
- Optuna: https://optuna.org/
- PyCaret: https://pycaret.org/