최적화 개요¶
최적화(Optimization)는 주어진 제약 조건 하에서 목적 함수를 최소화(또는 최대화)하는 파라미터를 찾는 수학적 방법론. 머신러닝의 학습 과정, 하이퍼파라미터 튜닝, 비즈니스 의사결정 등에 핵심적.
핵심 개념¶
최적화 문제 정의¶
\[\min_{x \in \mathbb{R}^n} f(x)$$
$$\text{s.t. } g_i(x) \leq 0, \quad i = 1, ..., m$$
$$h_j(x) = 0, \quad j = 1, ..., p\]
| 구성 요소 | 설명 |
|---|---|
| \(f(x)\) | 목적 함수 (Objective) |
| \(g_i(x)\) | 부등식 제약 |
| \(h_j(x)\) | 등식 제약 |
문제 분류¶
| 분류 | 특징 | 예시 |
|---|---|---|
| Convex | 전역 최적해 보장 | Linear, Quadratic Programming |
| Non-convex | 국소 최적해 가능 | Neural Network 학습 |
| Constrained | 제약 조건 있음 | 포트폴리오 최적화 |
| Unconstrained | 제약 없음 | 회귀 |
| Discrete | 정수 변수 | 조합 최적화 |
| Continuous | 연속 변수 | 경사하강법 |
알고리즘 분류 체계¶
Optimization Algorithms
├── Gradient-based (First-order)
│ ├── Gradient Descent
│ │ ├── Batch GD
│ │ ├── Stochastic GD (SGD)
│ │ └── Mini-batch GD
│ ├── Momentum Methods
│ │ ├── SGD with Momentum
│ │ └── Nesterov Accelerated Gradient
│ └── Adaptive Learning Rate
│ ├── AdaGrad
│ ├── RMSprop
│ ├── Adam, AdamW
│ └── LAMB, LARS
├── Second-order Methods
│ ├── Newton's Method
│ ├── Quasi-Newton (BFGS, L-BFGS)
│ └── Natural Gradient
├── Derivative-free
│ ├── Evolutionary Algorithms
│ │ ├── Genetic Algorithm
│ │ ├── Evolution Strategies
│ │ └── CMA-ES
│ ├── Particle Swarm Optimization
│ └── Simulated Annealing
├── Constrained Optimization
│ ├── Lagrange Multipliers
│ ├── KKT Conditions
│ ├── Interior Point Methods
│ └── Augmented Lagrangian
└── Discrete Optimization
├── Integer Programming
├── Branch and Bound
└── Dynamic Programming
경사하강법 (Gradient Descent)¶
기본 알고리즘¶
\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\]
| 변형 | 특징 |
|---|---|
| Batch GD | 전체 데이터, 느림, 안정 |
| SGD | 단일 샘플, 빠름, 노이즈 |
| Mini-batch | 절충안, 실용적 |
Momentum¶
\[v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t)$$
$$\theta_{t+1} = \theta_t - v_t\]
관성을 추가하여 진동 감소, 수렴 가속.
Adam (Adaptive Moment Estimation)¶
1차 모멘트(평균)와 2차 모멘트(분산) 추정:
\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$
$$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]
편향 보정:
\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]
업데이트:
\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]
import torch.optim as optim
# SGD with Momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)
# Adam
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))
# AdamW (분리된 가중치 감쇠)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)
참고 논문: - Kingma, D.P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR.
학습률 스케줄러¶
from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau
# Step Decay
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)
# Cosine Annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)
# One Cycle (Super Convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)
# Reduce on Plateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)
# 학습 루프
for epoch in range(epochs):
train()
val_loss = validate()
scheduler.step(val_loss) # ReduceLROnPlateau
# 또는 scheduler.step() # 다른 스케줄러
Warmup¶
def get_linear_warmup_scheduler(optimizer, warmup_steps, total_steps):
def lr_lambda(step):
if step < warmup_steps:
return step / warmup_steps
return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)
진화 알고리즘¶
Genetic Algorithm¶
import numpy as np
class GeneticAlgorithm:
def __init__(self, fitness_func, n_genes, pop_size=100, mutation_rate=0.01):
self.fitness_func = fitness_func
self.n_genes = n_genes
self.pop_size = pop_size
self.mutation_rate = mutation_rate
def initialize_population(self):
return np.random.randn(self.pop_size, self.n_genes)
def selection(self, population, fitness):
# 토너먼트 선택
idx = np.random.choice(len(population), size=2)
return population[idx[np.argmax(fitness[idx])]]
def crossover(self, parent1, parent2):
# 단일점 교차
point = np.random.randint(0, self.n_genes)
child = np.concatenate([parent1[:point], parent2[point:]])
return child
def mutate(self, individual):
mask = np.random.random(self.n_genes) < self.mutation_rate
individual[mask] += np.random.randn(mask.sum())
return individual
def evolve(self, generations=100):
population = self.initialize_population()
for gen in range(generations):
fitness = np.array([self.fitness_func(ind) for ind in population])
new_population = []
for _ in range(self.pop_size):
parent1 = self.selection(population, fitness)
parent2 = self.selection(population, fitness)
child = self.crossover(parent1, parent2)
child = self.mutate(child)
new_population.append(child)
population = np.array(new_population)
if gen % 10 == 0:
print(f"Gen {gen}: Best fitness = {fitness.max():.4f}")
best_idx = np.argmax([self.fitness_func(ind) for ind in population])
return population[best_idx]
CMA-ES¶
공분산 행렬 적응 진화 전략:
import cma
def objective(x):
return sum((x - 3) ** 2)
es = cma.CMAEvolutionStrategy([0] * 10, 0.5)
es.optimize(objective)
print(f"Best solution: {es.result.xbest}")
제약 최적화¶
Lagrange Multipliers¶
\[\mathcal{L}(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x)\]
KKT 조건¶
- Stationarity: \(\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) = 0\)
- Primal feasibility: \(g_i(x^*) \leq 0\)
- Dual feasibility: \(\lambda_i \geq 0\)
- Complementary slackness: \(\lambda_i g_i(x^*) = 0\)
SciPy 최적화¶
from scipy.optimize import minimize
def objective(x):
return (x[0] - 1)**2 + (x[1] - 2.5)**2
def constraint1(x):
return x[0] - 2 * x[1] + 2 # >= 0
def constraint2(x):
return -x[0] - 2 * x[1] + 6 # >= 0
constraints = [
{'type': 'ineq', 'fun': constraint1},
{'type': 'ineq', 'fun': constraint2},
]
bounds = [(0, None), (0, None)]
result = minimize(objective, x0=[0, 0], method='SLSQP',
bounds=bounds, constraints=constraints)
print(f"Optimal: {result.x}, Value: {result.fun}")
조합 최적화¶
OR-Tools 예제 (TSP)¶
from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp
def create_data_model():
data = {}
data['distance_matrix'] = [
[0, 2451, 713, 1018],
[2451, 0, 1745, 1524],
[713, 1745, 0, 355],
[1018, 1524, 355, 0],
]
data['num_vehicles'] = 1
data['depot'] = 0
return data
def solve_tsp():
data = create_data_model()
manager = pywrapcp.RoutingIndexManager(
len(data['distance_matrix']), data['num_vehicles'], data['depot'])
routing = pywrapcp.RoutingModel(manager)
def distance_callback(from_index, to_index):
from_node = manager.IndexToNode(from_index)
to_node = manager.IndexToNode(to_index)
return data['distance_matrix'][from_node][to_node]
transit_callback_index = routing.RegisterTransitCallback(distance_callback)
routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)
search_parameters = pywrapcp.DefaultRoutingSearchParameters()
search_parameters.first_solution_strategy = (
routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)
solution = routing.SolveWithParameters(search_parameters)
return solution
참고 문헌¶
교과서¶
- Boyd, S. & Vandenberghe, L. (2004). "Convex Optimization". Cambridge. (무료 온라인)
- Nocedal, J. & Wright, S. (2006). "Numerical Optimization". Springer.
핵심 논문¶
- Kingma, D.P. & Ba, J. (2015). "Adam". ICLR.
- Loshchilov, I. & Hutter, F. (2017). "Decoupled Weight Decay Regularization". ICLR.
라이브러리¶
- SciPy optimize: https://docs.scipy.org/doc/scipy/reference/optimize.html
- Google OR-Tools: https://developers.google.com/optimization
- CMA-ES: https://github.com/CMA-ES/pycma