콘텐츠로 이동
Data Prep
상세

최적화 개요

최적화(Optimization)는 주어진 제약 조건 하에서 목적 함수를 최소화(또는 최대화)하는 파라미터를 찾는 수학적 방법론. 머신러닝의 학습 과정, 하이퍼파라미터 튜닝, 비즈니스 의사결정 등에 핵심적.


핵심 개념

최적화 문제 정의

\[\min_{x \in \mathbb{R}^n} f(x)$$ $$\text{s.t. } g_i(x) \leq 0, \quad i = 1, ..., m$$ $$h_j(x) = 0, \quad j = 1, ..., p\]
구성 요소 설명
\(f(x)\) 목적 함수 (Objective)
\(g_i(x)\) 부등식 제약
\(h_j(x)\) 등식 제약

문제 분류

분류 특징 예시
Convex 전역 최적해 보장 Linear, Quadratic Programming
Non-convex 국소 최적해 가능 Neural Network 학습
Constrained 제약 조건 있음 포트폴리오 최적화
Unconstrained 제약 없음 회귀
Discrete 정수 변수 조합 최적화
Continuous 연속 변수 경사하강법

알고리즘 분류 체계

Optimization Algorithms
├── Gradient-based (First-order)
│   ├── Gradient Descent
│   │   ├── Batch GD
│   │   ├── Stochastic GD (SGD)
│   │   └── Mini-batch GD
│   ├── Momentum Methods
│   │   ├── SGD with Momentum
│   │   └── Nesterov Accelerated Gradient
│   └── Adaptive Learning Rate
│       ├── AdaGrad
│       ├── RMSprop
│       ├── Adam, AdamW
│       └── LAMB, LARS
├── Second-order Methods
│   ├── Newton's Method
│   ├── Quasi-Newton (BFGS, L-BFGS)
│   └── Natural Gradient
├── Derivative-free
│   ├── Evolutionary Algorithms
│   │   ├── Genetic Algorithm
│   │   ├── Evolution Strategies
│   │   └── CMA-ES
│   ├── Particle Swarm Optimization
│   └── Simulated Annealing
├── Constrained Optimization
│   ├── Lagrange Multipliers
│   ├── KKT Conditions
│   ├── Interior Point Methods
│   └── Augmented Lagrangian
└── Discrete Optimization
    ├── Integer Programming
    ├── Branch and Bound
    └── Dynamic Programming

경사하강법 (Gradient Descent)

기본 알고리즘

\[\theta_{t+1} = \theta_t - \eta \nabla_\theta L(\theta_t)\]
변형 특징
Batch GD 전체 데이터, 느림, 안정
SGD 단일 샘플, 빠름, 노이즈
Mini-batch 절충안, 실용적

Momentum

\[v_t = \gamma v_{t-1} + \eta \nabla_\theta L(\theta_t)$$ $$\theta_{t+1} = \theta_t - v_t\]

관성을 추가하여 진동 감소, 수렴 가속.

Adam (Adaptive Moment Estimation)

1차 모멘트(평균)와 2차 모멘트(분산) 추정:

\[m_t = \beta_1 m_{t-1} + (1 - \beta_1) g_t$$ $$v_t = \beta_2 v_{t-1} + (1 - \beta_2) g_t^2\]

편향 보정:

\[\hat{m}_t = \frac{m_t}{1 - \beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1 - \beta_2^t}\]

업데이트:

\[\theta_{t+1} = \theta_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]
import torch.optim as optim

# SGD with Momentum
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, weight_decay=1e-4)

# Adam
optimizer = optim.Adam(model.parameters(), lr=0.001, betas=(0.9, 0.999))

# AdamW (분리된 가중치 감쇠)
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

참고 논문: - Kingma, D.P. & Ba, J. (2015). "Adam: A Method for Stochastic Optimization". ICLR.


학습률 스케줄러

from torch.optim.lr_scheduler import StepLR, CosineAnnealingLR, OneCycleLR, ReduceLROnPlateau

# Step Decay
scheduler = StepLR(optimizer, step_size=30, gamma=0.1)

# Cosine Annealing
scheduler = CosineAnnealingLR(optimizer, T_max=100, eta_min=1e-6)

# One Cycle (Super Convergence)
scheduler = OneCycleLR(optimizer, max_lr=0.01, total_steps=1000)

# Reduce on Plateau
scheduler = ReduceLROnPlateau(optimizer, mode='min', factor=0.5, patience=10)

# 학습 루프
for epoch in range(epochs):
    train()
    val_loss = validate()
    scheduler.step(val_loss)  # ReduceLROnPlateau
    # 또는 scheduler.step()  # 다른 스케줄러

Warmup

def get_linear_warmup_scheduler(optimizer, warmup_steps, total_steps):
    def lr_lambda(step):
        if step < warmup_steps:
            return step / warmup_steps
        return max(0.0, (total_steps - step) / (total_steps - warmup_steps))
    return torch.optim.lr_scheduler.LambdaLR(optimizer, lr_lambda)

진화 알고리즘

Genetic Algorithm

import numpy as np

class GeneticAlgorithm:
    def __init__(self, fitness_func, n_genes, pop_size=100, mutation_rate=0.01):
        self.fitness_func = fitness_func
        self.n_genes = n_genes
        self.pop_size = pop_size
        self.mutation_rate = mutation_rate

    def initialize_population(self):
        return np.random.randn(self.pop_size, self.n_genes)

    def selection(self, population, fitness):
        # 토너먼트 선택
        idx = np.random.choice(len(population), size=2)
        return population[idx[np.argmax(fitness[idx])]]

    def crossover(self, parent1, parent2):
        # 단일점 교차
        point = np.random.randint(0, self.n_genes)
        child = np.concatenate([parent1[:point], parent2[point:]])
        return child

    def mutate(self, individual):
        mask = np.random.random(self.n_genes) < self.mutation_rate
        individual[mask] += np.random.randn(mask.sum())
        return individual

    def evolve(self, generations=100):
        population = self.initialize_population()

        for gen in range(generations):
            fitness = np.array([self.fitness_func(ind) for ind in population])

            new_population = []
            for _ in range(self.pop_size):
                parent1 = self.selection(population, fitness)
                parent2 = self.selection(population, fitness)
                child = self.crossover(parent1, parent2)
                child = self.mutate(child)
                new_population.append(child)

            population = np.array(new_population)

            if gen % 10 == 0:
                print(f"Gen {gen}: Best fitness = {fitness.max():.4f}")

        best_idx = np.argmax([self.fitness_func(ind) for ind in population])
        return population[best_idx]

CMA-ES

공분산 행렬 적응 진화 전략:

import cma

def objective(x):
    return sum((x - 3) ** 2)

es = cma.CMAEvolutionStrategy([0] * 10, 0.5)
es.optimize(objective)
print(f"Best solution: {es.result.xbest}")

제약 최적화

Lagrange Multipliers

\[\mathcal{L}(x, \lambda) = f(x) + \sum_i \lambda_i g_i(x)\]

KKT 조건

  1. Stationarity: \(\nabla f(x^*) + \sum_i \lambda_i \nabla g_i(x^*) = 0\)
  2. Primal feasibility: \(g_i(x^*) \leq 0\)
  3. Dual feasibility: \(\lambda_i \geq 0\)
  4. Complementary slackness: \(\lambda_i g_i(x^*) = 0\)

SciPy 최적화

from scipy.optimize import minimize

def objective(x):
    return (x[0] - 1)**2 + (x[1] - 2.5)**2

def constraint1(x):
    return x[0] - 2 * x[1] + 2  # >= 0

def constraint2(x):
    return -x[0] - 2 * x[1] + 6  # >= 0

constraints = [
    {'type': 'ineq', 'fun': constraint1},
    {'type': 'ineq', 'fun': constraint2},
]

bounds = [(0, None), (0, None)]

result = minimize(objective, x0=[0, 0], method='SLSQP', 
                  bounds=bounds, constraints=constraints)
print(f"Optimal: {result.x}, Value: {result.fun}")

조합 최적화

OR-Tools 예제 (TSP)

from ortools.constraint_solver import routing_enums_pb2
from ortools.constraint_solver import pywrapcp

def create_data_model():
    data = {}
    data['distance_matrix'] = [
        [0, 2451, 713, 1018],
        [2451, 0, 1745, 1524],
        [713, 1745, 0, 355],
        [1018, 1524, 355, 0],
    ]
    data['num_vehicles'] = 1
    data['depot'] = 0
    return data

def solve_tsp():
    data = create_data_model()
    manager = pywrapcp.RoutingIndexManager(
        len(data['distance_matrix']), data['num_vehicles'], data['depot'])
    routing = pywrapcp.RoutingModel(manager)

    def distance_callback(from_index, to_index):
        from_node = manager.IndexToNode(from_index)
        to_node = manager.IndexToNode(to_index)
        return data['distance_matrix'][from_node][to_node]

    transit_callback_index = routing.RegisterTransitCallback(distance_callback)
    routing.SetArcCostEvaluatorOfAllVehicles(transit_callback_index)

    search_parameters = pywrapcp.DefaultRoutingSearchParameters()
    search_parameters.first_solution_strategy = (
        routing_enums_pb2.FirstSolutionStrategy.PATH_CHEAPEST_ARC)

    solution = routing.SolveWithParameters(search_parameters)
    return solution

참고 문헌

교과서

  • Boyd, S. & Vandenberghe, L. (2004). "Convex Optimization". Cambridge. (무료 온라인)
  • Nocedal, J. & Wright, S. (2006). "Numerical Optimization". Springer.

핵심 논문

  • Kingma, D.P. & Ba, J. (2015). "Adam". ICLR.
  • Loshchilov, I. & Hutter, F. (2017). "Decoupled Weight Decay Regularization". ICLR.

라이브러리

  • SciPy optimize: https://docs.scipy.org/doc/scipy/reference/optimize.html
  • Google OR-Tools: https://developers.google.com/optimization
  • CMA-ES: https://github.com/CMA-ES/pycma