최적화 개요¶

최적화(Optimization)는 주어진 제약 조건 하에서 목적 함수를 최소화(또는 최대화)하는 파라미터를 찾는 수학적 방법론. 머신러닝의 학습 과정, 하이퍼파라미터 튜닝, 비즈니스 의사결정 등에 핵심적.

핵심 개념¶

최적화 문제 정의¶

\[\min_{x \in \mathbb{R}^n} f(x)$$ $$\text{s.t. } g_i(x) \leq 0, \quad i = 1, ..., m$$ $$h_j(x) = 0, \quad j = 1, ..., p\]

구성 요소	설명
$f(x)$	목적 함수 (Objective)
$g_i(x)$	부등식 제약
$h_j(x)$	등식 제약

문제 분류¶

분류	특징	예시
Convex	전역 최적해 보장	Linear, Quadratic Programming
Non-convex	국소 최적해 가능	Neural Network 학습
Constrained	제약 조건 있음	포트폴리오 최적화
Unconstrained	제약 없음	회귀
Discrete	정수 변수	조합 최적화
Continuous	연속 변수	경사하강법

알고리즘 분류 체계¶

Optimization Algorithms
├── Gradient-based (First-order)
│   ├── Gradient Descent
│   │   ├── Batch GD
│   │   ├── Stochastic GD (SGD)
│   │   └── Mini-batch GD
│   ├── Momentum Methods
│   │   ├── SGD with Momentum
│   │   └── Nesterov Accelerated Gradient
│   └── Adaptive Learning Rate
│       ├── AdaGrad
│       ├── RMSprop
│       ├── Adam, AdamW
│       └── LAMB, LARS
├── Second-order Methods
│   ├── Newton's Method
│   ├── Quasi-Newton (BFGS, L-BFGS)
│   └── Natural Gradient
├── Derivative-free
│   ├── Evolutionary Algorithms
│   │   ├── Genetic Algorithm
│   │   ├── Evolution Strategies
│   │   └── CMA-ES
│   ├── Particle Swarm Optimization
│   └── Simulated Annealing
├── Constrained Optimization
│   ├── Lagrange Multipliers
│   ├── KKT Conditions
│   ├── Interior Point Methods
│   └── Augmented Lagrangian
└── Discrete Optimization
    ├── Integer Programming
    ├── Branch and Bound
    └── Dynamic Programming

경사하강법 (Gradient Descent)¶

기본 알고리즘¶

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\]

변형	특징
Batch GD	전체 데이터, 느림, 안정
SGD	단일 샘플, 빠름, 노이즈
Mini-batch	절충안, 실용적

Momentum¶

\[v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t\]

관성을 추가하여 진동 감소, 수렴 가속.

Adam (Adaptive Moment Estimation)¶

1차 모멘트(평균)와 2차 모멘트(분산) 추정:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]

편향 보정:

\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

업데이트:

\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]

import torch.optim as optim

# SGD with Momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (분리된 가중치 감쇠)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

참고 논문: - Kingma, D.P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR.

학습률 스케줄러¶

from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau

# Step Decay
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine Annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One Cycle (Super Convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# Reduce on Plateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

# 학습 루프
for epoch in range(epochs):
    train()
    val_loss = validate()
    scheduler.step(val_loss)  # ReduceLROnPlateau
    # 또는 scheduler.step()  # 다른 스케줄러

Warmup¶

def get_linear_warmup_scheduler(optimizer, warmup_steps, total_steps):
    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

진화 알고리즘¶

Genetic Algorithm¶

import numpy as np

class GeneticAlgorithm:
    def __init__(self, fitness_func, n_genes, pop_size=100, mutation_rate=0.01):
        self.fitness_func = fitness_func
        self.n_genes = n_genes
        self.pop_size = pop_size
        self.mutation_rate = mutation_rate

    def initialize_population(self):
        return np.random.randn(self.pop_size, self.n_genes)

    def selection(self, population, fitness):
        # 토너먼트 선택
        idx = np.random.choice(len(population), size=2)
        return population[idx[np.argmax(fitness[idx])]]

    def crossover(self, parent1, parent2):
        # 단일점 교차
        point = np.random.randint(0, self.n_genes)
        child = np.concatenate([parent1[:point], parent2[point:]])
        return child

    def mutate(self, individual):
        mask = np.random.random(self.n_genes) < self.mutation_rate
        individual[mask] += np.random.randn(mask.sum())
        return individual

    def evolve(self, generations=100):
        population = self.initialize_population()

        for gen in range(generations):
            fitness = np.array([self.fitness_func(ind) for ind in population])

            new_population = []
            for _ in range(self.pop_size):
                parent1 = self.selection(population, fitness)
                parent2 = self.selection(population, fitness)
                child = self.crossover(parent1, parent2)
                child = self.mutate(child)
                new_population.append(child)

            population = np.array(new_population)

            if gen % 10 == 0:
                print(f"Gen {gen}: Best fitness = {fitness.max():.4f}")

        best_idx = np.argmax([self.fitness_func(ind) for ind in population])
        return population[best_idx]

CMA-ES¶

공분산 행렬 적응 진화 전략:

import cma

def objective(x):
    return sum((x - 3) ** 2)

es = cma.CMAEvolutionStrategy([0] * 10, 0.5)
es.optimize(objective)
print(f"Best solution: {es.result.xbest}")

제약 최적화¶

Lagrange Multipliers¶

\[\mathcal{L}(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x)\]

KKT 조건¶

Stationarity: $\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) = 0$
Primal feasibility: $g_i(x^*) \leq 0$
Dual feasibility: $\lambda_i \geq 0$
Complementary slackness: $\lambda_i g_i(x^*) = 0$

SciPy 최적화¶

from scipy.optimize import minimize

def objective(x):
    return (x[0] - 1)**2 + (x[1] - 2.5)**2

def constraint1(x):
    return x[0] - 2 * x[1] + 2  # >= 0

def constraint2(x):
    return -x[0] - 2 * x[1] + 6  # >= 0

constraints = [
    {'type': 'ineq', 'fun': constraint1},
    {'type': 'ineq', 'fun': constraint2},
]

bounds = [(0, None), (0, None)]

result = minimize(objective, x0=[0, 0], method='SLSQP', 
                  bounds=bounds, constraints=constraints)
print(f"Optimal: {result.x}, Value: {result.fun}")

조합 최적화¶

OR-Tools 예제 (TSP)¶

from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp

def create_data_model():
    data = {}
    data['distance_matrix'] = [
        [0, 2451, 713, 1018],
        [2451, 0, 1745, 1524],
        [713, 1745, 0, 355],
        [1018, 1524, 355, 0],
    ]
    data['num_vehicles'] = 1
    data['depot'] = 0
    return data

def solve_tsp():
    data = create_data_model()
    manager = pywrapcp.RoutingIndexManager(
        len(data['distance_matrix']), data['num_vehicles'], data['depot'])
    routing = pywrapcp.RoutingModel(manager)

    def distance_callback(from_index, to_index):
        from_node = manager.IndexToNode(from_index)
        to_node = manager.IndexToNode(to_index)
        return data['distance_matrix'][from_node][to_node]

    transit_callback_index = routing.RegisterTransitCallback(distance_callback)
    routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)

    search_parameters = pywrapcp.DefaultRoutingSearchParameters()
    search_parameters.first_solution_strategy = (
        routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)

    solution = routing.SolveWithParameters(search_parameters)
    return solution

참고 문헌¶

교과서¶

Boyd, S. & Vandenberghe, L. (2004). "Convex Optimization". Cambridge. (무료 온라인)
Nocedal, J. & Wright, S. (2006). "Numerical Optimization". Springer.

핵심 논문¶

Kingma, D.P. & Ba, J. (2015). "Adam". ICLR.
Loshchilov, I. & Hutter, F. (2017). "Decoupled Weight Decay Regularization". ICLR.

라이브러리¶

SciPy optimize: https://docs.scipy.org/doc/scipy/reference/optimize.html
Google OR-Tools: https://developers.google.com/optimization
CMA-ES: https://github.com/CMA-ES/pycma

구성 요소	설명
\(f(x)\)	목적 함수 (Objective)
\(g_i(x)\)	부등식 제약
\(h_j(x)\)	등식 제약