경사 하강법 (Gradient Descent)¶

손실 함수를 최소화하기 위해 파라미터를 반복적으로 업데이트하는 최적화 알고리즘. 모든 신경망 학습의 핵심.

기본 원리¶

직관적 이해¶

산에서 내려가는 것과 같다: - 현재 위치에서 가장 가파른 방향을 찾는다 (기울기) - 그 방향으로 한 걸음 내딛는다 (업데이트) - 더 이상 내려갈 수 없을 때까지 반복함 (수렴)

수학적 정의¶

\[w_{t+1} = w_t - \eta \nabla_w L(w_t)\]

기호	의미
$w$	파라미터 (가중치)
$\eta$	학습률 (learning rate)
$\nabla_w L$	손실 함수의 기울기 (gradient)
$t$	반복 횟수 (iteration)

import numpy as np

def gradient_descent(X, y, learning_rate=0.01, n_iterations=1000):
    """기본 경사 하강법 (선형 회귀)"""
    n_samples, n_features = X.shape
    weights = np.zeros(n_features)
    bias = 0

    for _ in range(n_iterations):
        # 예측
        y_pred = X @ weights + bias

        # 손실 (MSE)
        loss = np.mean((y - y_pred) ** 2)

        # 기울기 계산
        dw = -(2 / n_samples) * X.T @ (y - y_pred)
        db = -(2 / n_samples) * np.sum(y - y_pred)

        # 파라미터 업데이트
        weights -= learning_rate * dw
        bias -= learning_rate * db

    return weights, bias

변형 (Variants)¶

배치 경사 하강법 (Batch GD)¶

전체 데이터셋을 사용하여 기울기 계산.

장점: 안정적 수렴, 정확한 기울기
단점: 느림, 메모리 부족 가능

def batch_gradient_descent(X, y, lr=0.01, epochs=100):
    w = np.zeros(X.shape[1])

    for _ in range(epochs):
        gradient = compute_gradient(X, y, w)  # 전체 데이터
        w -= lr * gradient

    return w

확률적 경사 하강법 (Stochastic GD)¶

하나의 샘플로 기울기 계산.

장점: 빠름, 지역 최솟값 탈출 가능
단점: 노이즈가 많음, 불안정

def stochastic_gradient_descent(X, y, lr=0.01, epochs=100):
    w = np.zeros(X.shape[1])
    n = len(y)

    for _ in range(epochs):
        for i in np.random.permutation(n):
            gradient = compute_gradient(X[i:i+1], y[i:i+1], w)  # 1개 샘플
            w -= lr * gradient

    return w

미니배치 경사 하강법 (Mini-batch GD)¶

작은 배치로 기울기 계산. 가장 널리 사용됨.

장점: 배치와 SGD의 장점 결합, GPU 효율적
단점: 배치 크기 튜닝 필요

def minibatch_gradient_descent(X, y, lr=0.01, epochs=100, batch_size=32):
    w = np.zeros(X.shape[1])
    n = len(y)

    for _ in range(epochs):
        indices = np.random.permutation(n)

        for start in range(0, n, batch_size):
            end = min(start + batch_size, n)
            batch_idx = indices[start:end]

            gradient = compute_gradient(X[batch_idx], y[batch_idx], w)
            w -= lr * gradient

    return w

배치 크기의 영향:

배치 크기	기울기 노이즈	수렴 속도	일반화	GPU 활용
작음 (32)	높음	느림	좋음	낮음
중간 (256)	중간	중간	중간	중간
큼 (4096+)	낮음	빠름	나쁠 수 있음	높음

학습률 (Learning Rate)¶

학습률의 영향¶

학습률이 너무 작음: 수렴이 매우 느림
학습률이 적절함: 안정적이고 빠른 수렴
학습률이 너무 큼: 발산 또는 진동

학습률 스케줄링¶

# 1. Step Decay
def step_decay(epoch, initial_lr=0.01, drop=0.5, epochs_drop=10):
    return initial_lr * (drop ** (epoch // epochs_drop))

# 2. Exponential Decay
def exponential_decay(epoch, initial_lr=0.01, decay_rate=0.95):
    return initial_lr * (decay_rate ** epoch)

# 3. Cosine Annealing
def cosine_annealing(epoch, initial_lr=0.01, T_max=100):
    return initial_lr * (1 + np.cos(np.pi * epoch / T_max)) / 2

# 4. Warmup + Decay
def warmup_decay(epoch, initial_lr=0.01, warmup_epochs=5):
    if epoch < warmup_epochs:
        return initial_lr * epoch / warmup_epochs
    else:
        return initial_lr * (0.95 ** (epoch - warmup_epochs))

# PyTorch 학습률 스케줄러
import torch.optim as optim

optimizer = optim.Adam(model.parameters(), lr=0.001)

# StepLR
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=10, gamma=0.1)

# CosineAnnealingLR
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)

# OneCycleLR (추천)
scheduler = optim.lr_scheduler.OneCycleLR(
    optimizer, 
    max_lr=0.01,
    total_steps=num_epochs * len(train_loader)
)

# 학습 루프
for epoch in range(num_epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        loss = compute_loss(batch)
        loss.backward()
        optimizer.step()
        scheduler.step()  # 배치마다 또는 에폭마다

모멘텀 (Momentum)¶

이전 기울기의 방향을 기억하여 진동 감소 및 수렴 가속.

\[v_t = \gamma v_{t-1} + \eta \nabla_w L$$ $$w_{t+1} = w_t - v_t\]

def sgd_with_momentum(X, y, lr=0.01, momentum=0.9, epochs=100):
    w = np.zeros(X.shape[1])
    v = np.zeros_like(w)  # velocity

    for _ in range(epochs):
        gradient = compute_gradient(X, y, w)
        v = momentum * v + lr * gradient
        w -= v

    return w

Nesterov Momentum¶

"앞서 보고" 기울기 계산.

\[v_t = \gamma v_{t-1} + \eta \nabla_w L(w_t - \gamma v_{t-1})\]

# PyTorch
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9, nesterov=True)

적응적 학습률 (Adaptive Learning Rate)¶

AdaGrad¶

자주 업데이트되는 파라미터는 학습률 감소.

\[w_{t+1} = w_t - \frac{\eta}{\sqrt{G_t + \epsilon}} \nabla_w L\]

$G_t$: 기울기 제곱의 누적합
문제: 학습률이 계속 감소하여 0에 수렴

RMSprop¶

AdaGrad의 문제 해결 (지수 이동 평균 사용).

\[E[g^2]_t = \gamma E[g^2]_{t-1} + (1-\gamma) g_t^2$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{E[g^2]_t + \epsilon}} g_t\]

def rmsprop(X, y, lr=0.01, decay=0.9, eps=1e-8, epochs=100):
    w = np.zeros(X.shape[1])
    cache = np.zeros_like(w)

    for _ in range(epochs):
        gradient = compute_gradient(X, y, w)
        cache = decay * cache + (1 - decay) * gradient ** 2
        w -= lr * gradient / (np.sqrt(cache) + eps)

    return w

Adam (Adaptive Moment Estimation)¶

모멘텀 + RMSprop 결합. 가장 널리 사용됨.

\[m_t = \beta_1 m_{t-1} + (1-\beta_1) g_t \quad \text{(1차 모멘트)}$$ $$v_t = \beta_2 v_{t-1} + (1-\beta_2) g_t^2 \quad \text{(2차 모멘트)}$$ $$\hat{m}_t = \frac{m_t}{1-\beta_1^t}, \quad \hat{v}_t = \frac{v_t}{1-\beta_2^t} \quad \text{(편향 보정)}$$ $$w_{t+1} = w_t - \frac{\eta}{\sqrt{\hat{v}_t} + \epsilon} \hat{m}_t\]

def adam(X, y, lr=0.001, beta1=0.9, beta2=0.999, eps=1e-8, epochs=100):
    w = np.zeros(X.shape[1])
    m = np.zeros_like(w)  # 1st moment
    v = np.zeros_like(w)  # 2nd moment
    t = 0

    for _ in range(epochs):
        t += 1
        gradient = compute_gradient(X, y, w)

        m = beta1 * m + (1 - beta1) * gradient
        v = beta2 * v + (1 - beta2) * gradient ** 2

        m_hat = m / (1 - beta1 ** t)
        v_hat = v / (1 - beta2 ** t)

        w -= lr * m_hat / (np.sqrt(v_hat) + eps)

    return w

AdamW (Adam with Weight Decay)¶

L2 정규화를 올바르게 적용.

# PyTorch
optimizer = optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

옵티마이저 비교¶

옵티마이저	특징	사용 사례
SGD	단순, 강건	컴퓨터 비전, 미세 조정
SGD+Momentum	빠른 수렴	일반적
Adam	적응적, 빠름	NLP, Transformer
AdamW	Adam + 올바른 정규화	LLM 학습
LAMB	대규모 배치	분산 학습

# 일반적인 선택
# CNN: SGD + Momentum + Weight Decay
optimizer = optim.SGD(model.parameters(), lr=0.1, momentum=0.9, weight_decay=1e-4)

# Transformer/LLM: AdamW
optimizer = optim.AdamW(model.parameters(), lr=1e-4, betas=(0.9, 0.999), weight_decay=0.01)

수렴 분석¶

볼록 함수에서의 수렴¶

볼록 (Convex): 전역 최솟값 보장
비볼록 (Non-convex): 지역 최솟값에 수렴 가능

수렴 조건¶

학습률 감소: $\sum \eta_t = \infty$, $\sum \eta_t^2 < \infty$
Lipschitz 연속: 기울기의 변화가 제한적

수렴 모니터링¶

def train_with_monitoring(model, train_loader, val_loader, epochs=100):
    history = {'train_loss': [], 'val_loss': [], 'grad_norm': []}

    for epoch in range(epochs):
        model.train()
        train_loss = 0
        grad_norms = []

        for batch in train_loader:
            optimizer.zero_grad()
            loss = model(batch)
            loss.backward()

            # Gradient norm 기록
            total_norm = 0
            for p in model.parameters():
                if p.grad is not None:
                    total_norm += p.grad.data.norm(2).item() ** 2
            grad_norms.append(np.sqrt(total_norm))

            # Gradient clipping
            torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

            optimizer.step()
            train_loss += loss.item()

        # 검증
        model.eval()
        val_loss = evaluate(model, val_loader)

        history['train_loss'].append(train_loss / len(train_loader))
        history['val_loss'].append(val_loss)
        history['grad_norm'].append(np.mean(grad_norms))

        # 조기 종료 체크
        if len(history['val_loss']) > 10:
            if history['val_loss'][-1] > min(history['val_loss'][-10:-1]):
                print("Early stopping")
                break

    return history

Gradient Clipping¶

기울기 폭발 방지.

# Norm clipping
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)

# Value clipping
torch.nn.utils.clip_grad_value_(model.parameters(), clip_value=0.5)

Scikit-learn 예시¶

대부분의 scikit-learn 모델은 내부적으로 최적화 알고리즘을 사용함.

from sklearn.linear_model import SGDClassifier, SGDRegressor
from sklearn.neural_network import MLPClassifier

# SGD 기반 분류기 (SVM, Logistic Regression 등)
clf = SGDClassifier(
    loss='log_loss',          # logistic regression
    learning_rate='optimal',  # 적응적 학습률
    eta0=0.01,                # 초기 학습률
    max_iter=1000,
    early_stopping=True,
    validation_fraction=0.1,
    n_iter_no_change=5,
    random_state=42
)
clf.fit(X_train, y_train)

# MLP (신경망)
mlp = MLPClassifier(
    hidden_layer_sizes=(100, 50),
    solver='adam',            # 'sgd', 'adam', 'lbfgs'
    learning_rate='adaptive', # 'constant', 'invscaling', 'adaptive'
    learning_rate_init=0.001,
    max_iter=500,
    early_stopping=True,
    random_state=42
)
mlp.fit(X_train, y_train)

하이퍼파라미터 튜닝 팁¶

학습률 찾기 (Learning Rate Finder)¶

def lr_finder(model, train_loader, start_lr=1e-7, end_lr=10, num_iter=100):
    """학습률 범위 테스트"""
    lrs = np.logspace(np.log10(start_lr), np.log10(end_lr), num_iter)
    losses = []

    for lr in lrs:
        for param_group in optimizer.param_groups:
            param_group['lr'] = lr

        batch = next(iter(train_loader))
        loss = train_step(model, batch)
        losses.append(loss)

        if loss > 4 * min(losses):  # 발산 감지
            break

    # 손실이 가장 빠르게 감소하는 지점의 학습률 선택
    # 보통 최소 손실 지점보다 10배 작은 값을 사용
    return lrs, losses

배치 크기 vs 학습률¶

배치 크기를 늘릴 때 학습률도 비례하여 조정하는 것이 일반적.

배치 크기	학습률	비고
32	0.001	기준
64	0.002	2배
256	0.004	Linear scaling rule

# Linear scaling rule
base_lr = 0.001
base_batch_size = 32
new_batch_size = 256

new_lr = base_lr * (new_batch_size / base_batch_size)

옵티마이저 선택 전략¶

시작: Adam (lr=0.001, betas=(0.9, 0.999))
CNN: SGD + Momentum (lr=0.1, momentum=0.9) + Cosine Annealing
Transformer: AdamW (lr=1e-4, weight_decay=0.01) + Warmup
미세조정: 낮은 학습률 (원래의 1/10 ~ 1/100)

흔히 하는 실수¶

1. 학습률을 고정값으로 설정하고 튜닝하지 않음¶

학습률은 가장 중요한 하이퍼파라미터다. 반드시 탐색해야 함.

# 나쁜 예
optimizer = optim.Adam(model.parameters(), lr=0.001)  # 기본값 그대로

# 좋은 예: 그리드 서치 또는 LR Finder 사용
for lr in [1e-5, 1e-4, 1e-3, 1e-2]:
    model = create_model()
    optimizer = optim.Adam(model.parameters(), lr=lr)
    val_loss = train_and_evaluate(model, optimizer)
    print(f"LR: {lr}, Val Loss: {val_loss}")

2. 학습률 스케줄러 없이 학습¶

후반부에는 학습률을 낮춰야 수렴이 잘 됨.

# 나쁜 예: 스케줄러 없음
for epoch in range(100):
    train(model, optimizer)

# 좋은 예: 스케줄러 사용
scheduler = optim.lr_scheduler.CosineAnnealingLR(optimizer, T_max=100)
for epoch in range(100):
    train(model, optimizer)
    scheduler.step()

3. Gradient Clipping 없이 큰 모델 학습¶

RNN, Transformer 등에서 기울기 폭발 방지 필수.

# 좋은 예
torch.nn.utils.clip_grad_norm_(model.parameters(), max_norm=1.0)
optimizer.step()

4. 배치 크기만 늘리고 학습률 조정 안 함¶

# 나쁜 예: 배치 크기만 증가
batch_size = 256  # 기존 32에서 증가
optimizer = optim.Adam(model.parameters(), lr=0.001)  # 학습률 그대로

# 좋은 예: 비례하여 학습률 증가 또는 Warmup 추가
optimizer = optim.Adam(model.parameters(), lr=0.008)  # 8배 증가
# 또는 Learning Rate Warmup 사용

5. 손실이 NaN이 되는데 원인 파악 안 함¶

# 디버깅 체크리스트
def debug_nan_loss(model, batch):
    # 1. 입력 확인
    if torch.isnan(batch).any():
        print("입력에 NaN 있음")

    # 2. 출력 확인
    output = model(batch)
    if torch.isnan(output).any():
        print("출력에 NaN 있음")

    # 3. 기울기 확인
    loss = criterion(output, target)
    loss.backward()
    for name, param in model.named_parameters():
        if param.grad is not None and torch.isnan(param.grad).any():
            print(f"{name} 기울기에 NaN 있음")

    # 일반적 원인: 학습률 너무 큼, log(0), 0으로 나눔

6. 모멘텀 파라미터 무시¶

SGD에서 모멘텀은 큰 차이를 만든다.

# 나쁜 예
optimizer = optim.SGD(model.parameters(), lr=0.01)  # 모멘텀 없음

# 좋은 예
optimizer = optim.SGD(model.parameters(), lr=0.01, momentum=0.9)

기호	의미
\(w\)	파라미터 (가중치)
\(\eta\)	학습률 (learning rate)
\(\nabla_w L\)	손실 함수의 기울기 (gradient)
\(t\)	반복 횟수 (iteration)