정규화 기법 (Regularization)¶

과적합(Overfitting)을 방지하고 모델의 일반화 성능을 높이는 기법들. 모델이 훈련 데이터에 너무 맞추지 않도록 제약을 가함.

과적합이란¶

regularization diagram 1

Dropout¶

학습 중 무작위로 뉴런을 비활성화. 가장 널리 사용되는 정규화 기법.

원리¶

훈련 시:
[x1] --0.8-- [h1] --X--(비활성화)-- [output]
[x2] --0.5-- [h2] --1.2------------- 
[x3] --0.3-- [h3] --0.9-------------

추론 시:
모든 뉴런 활성화, 가중치에 (1-p) 스케일링
또는 훈련 시 활성화 뉴런을 1/(1-p)로 스케일링 (Inverted Dropout)

수식¶

\[y = \begin{cases} \frac{x \cdot m}{1-p} & \text{(training)} \\ x & \text{(inference)} \end{cases}\]

$m$: 베르누이 마스크 ($m_i \sim \text{Bernoulli}(1-p)$)
$p$: 드롭아웃 확률

구현¶

import torch
import torch.nn as nn
import torch.nn.functional as F

class Dropout(nn.Module):
    def __init__(self, p=0.5):
        super().__init__()
        self.p = p

    def forward(self, x):
        if self.training and self.p > 0:
            # Inverted Dropout: 훈련 시 스케일링
            mask = (torch.rand_like(x) > self.p).float()
            return x * mask / (1 - self.p)
        return x

# PyTorch 사용
dropout = nn.Dropout(p=0.5)

# 학습 중에만 적용됨
model.train()   # dropout 활성화
output = model(x)

model.eval()    # dropout 비활성화
output = model(x)

Dropout 변형¶

Dropout2d (Spatial Dropout)¶

CNN에서 채널 단위로 드롭.

# 공간적 상관관계가 있는 특징맵에 효과적
dropout2d = nn.Dropout2d(p=0.2)

# 입력: (N, C, H, W)
# 채널 전체가 드롭됨 (개별 픽셀이 아니라)

DropConnect¶

가중치를 드롭 (뉴런 대신).

class DropConnect(nn.Module):
    def __init__(self, module, p=0.5):
        super().__init__()
        self.module = module
        self.p = p

    def forward(self, x):
        if self.training:
            mask = torch.rand_like(self.module.weight) > self.p
            masked_weight = self.module.weight * mask / (1 - self.p)
            return F.linear(x, masked_weight, self.module.bias)
        return self.module(x)

DropPath (Stochastic Depth)¶

레이어/블록 전체를 드롭. ViT, EfficientNet 등에서 사용.

class DropPath(nn.Module):
    """레이어 단위 드롭아웃"""
    def __init__(self, drop_prob=0.0):
        super().__init__()
        self.drop_prob = drop_prob

    def forward(self, x):
        if self.drop_prob == 0. or not self.training:
            return x

        keep_prob = 1 - self.drop_prob
        # 배치 내 각 샘플에 대해 독립적으로 드롭
        shape = (x.shape[0],) + (1,) * (x.ndim - 1)
        random_tensor = keep_prob + torch.rand(shape, device=x.device)
        random_tensor.floor_()  # binarize

        return x.div(keep_prob) * random_tensor

# 사용
class ResBlock(nn.Module):
    def __init__(self, dim, drop_path=0.1):
        super().__init__()
        self.conv = nn.Conv2d(dim, dim, 3, padding=1)
        self.drop_path = DropPath(drop_path) if drop_path > 0 else nn.Identity()

    def forward(self, x):
        return x + self.drop_path(self.conv(x))

Dropout 위치¶

# MLP에서
class MLP(nn.Module):
    def __init__(self, dim, hidden_dim, dropout=0.1):
        super().__init__()
        self.fc1 = nn.Linear(dim, hidden_dim)
        self.fc2 = nn.Linear(hidden_dim, dim)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x):
        x = F.relu(self.fc1(x))
        x = self.dropout(x)  # 활성화 후, 다음 층 전
        x = self.fc2(x)
        return x

# Transformer에서
class TransformerBlock(nn.Module):
    def __init__(self, d_model, nhead, dropout=0.1):
        super().__init__()
        self.attn = nn.MultiheadAttention(d_model, nhead, dropout=dropout)
        self.ffn = MLP(d_model, d_model * 4, dropout)
        self.norm1 = nn.LayerNorm(d_model)
        self.norm2 = nn.LayerNorm(d_model)
        self.dropout = nn.Dropout(dropout)  # residual dropout

    def forward(self, x):
        # Attention
        attn_out, _ = self.attn(x, x, x)
        x = x + self.dropout(attn_out)  # residual connection에도 dropout
        x = self.norm1(x)

        # FFN
        x = x + self.dropout(self.ffn(x))
        x = self.norm2(x)
        return x

Weight Decay (L2 Regularization)¶

가중치가 커지는 것을 방지. 손실 함수에 가중치 크기에 대한 페널티 추가.

수식¶

\[L_{total} = L_{data} + \frac{\lambda}{2} \sum_i w_i^2\]

기울기: $$\frac{\partial L_{total}}{\partial w} = \frac{\partial L_{data}}{\partial w} + \lambda w$$

업데이트: $$w \leftarrow w - \eta \left(\frac{\partial L_{data}}{\partial w} + \lambda w\right) = (1 - \eta\lambda)w - \eta\frac{\partial L_{data}}{\partial w}$$

Weight Decay vs L2 Regularization¶

Adam과 같은 적응적 학습률 옵티마이저에서는 두 방법이 다르게 동작.

# L2 Regularization (손실에 추가)
loss = criterion(output, target) + 0.01 * sum(p.pow(2).sum() for p in model.parameters())
loss.backward()
optimizer.step()

# Weight Decay (Decoupled) - AdamW
# 파라미터에 직접 적용, 적응적 학습률과 분리
# w = w - lr * (grad + weight_decay * w)  # 틀림
# w = w - lr * grad - lr * weight_decay * w  # AdamW 방식 (맞음)

PyTorch 구현¶

# SGD + Weight Decay (=L2와 동일)
optimizer = torch.optim.SGD(model.parameters(), lr=0.01, weight_decay=0.0001)

# AdamW (권장, Decoupled Weight Decay)
optimizer = torch.optim.AdamW(model.parameters(), lr=0.001, weight_decay=0.01)

# 특정 파라미터에만 적용
no_decay = ['bias', 'LayerNorm.weight', 'LayerNorm.bias']
optimizer_grouped_parameters = [
    {
        'params': [p for n, p in model.named_parameters() 
                   if not any(nd in n for nd in no_decay)],
        'weight_decay': 0.01
    },
    {
        'params': [p for n, p in model.named_parameters() 
                   if any(nd in n for nd in no_decay)],
        'weight_decay': 0.0  # bias, norm에는 적용 안 함
    }
]
optimizer = torch.optim.AdamW(optimizer_grouped_parameters, lr=1e-4)

L1 Regularization (Lasso)¶

희소(sparse) 가중치를 유도. 일부 가중치를 정확히 0으로.

\[L_{total} = L_{data} + \lambda \sum_i |w_i|\]

def l1_regularization(model, lambda_l1=0.001):
    l1_norm = sum(p.abs().sum() for p in model.parameters())
    return lambda_l1 * l1_norm

# 학습 시
loss = criterion(output, target)
loss = loss + l1_regularization(model)
loss.backward()

L1 vs L2 비교¶

특성	L1 (Lasso)	L2 (Ridge)
가중치 분포	희소 (많은 0)	작지만 0 아님
특징 선택	O (자동)	X
미분 가능성	0에서 불가	가능
사용	특징 선택, 압축	일반적 정규화

Early Stopping¶

검증 손실이 개선되지 않으면 학습 중단.

class EarlyStopping:
    def __init__(self, patience=7, min_delta=0, restore_best=True):
        self.patience = patience
        self.min_delta = min_delta
        self.restore_best = restore_best

        self.counter = 0
        self.best_loss = float('inf')
        self.best_model = None
        self.should_stop = False

    def __call__(self, val_loss, model):
        if val_loss < self.best_loss - self.min_delta:
            self.best_loss = val_loss
            self.counter = 0
            if self.restore_best:
                self.best_model = {k: v.cpu().clone() for k, v in model.state_dict().items()}
        else:
            self.counter += 1
            if self.counter >= self.patience:
                self.should_stop = True
                if self.restore_best and self.best_model is not None:
                    model.load_state_dict(self.best_model)

        return self.should_stop

# 사용
early_stopping = EarlyStopping(patience=10)

for epoch in range(max_epochs):
    train_loss = train_one_epoch(model, train_loader)
    val_loss = evaluate(model, val_loader)

    if early_stopping(val_loss, model):
        print(f"Early stopping at epoch {epoch}")
        break

Data Augmentation¶

데이터를 변형하여 훈련 데이터 다양성 증가.

from torchvision import transforms

# 이미지 증강
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    transforms.RandomHorizontalFlip(),
    transforms.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2),
    transforms.RandomRotation(15),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    transforms.RandomErasing(p=0.5),  # Cutout 변형
])

# RandAugment (자동 증강)
from torchvision.transforms import RandAugment
train_transform = transforms.Compose([
    transforms.RandomResizedCrop(224),
    RandAugment(num_ops=2, magnitude=9),
    transforms.ToTensor(),
    transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
])

Mixup¶

두 샘플을 선형 보간하여 새로운 샘플 생성.

def mixup_data(x, y, alpha=0.2):
    """Mixup 데이터 증강"""
    if alpha > 0:
        lam = np.random.beta(alpha, alpha)
    else:
        lam = 1

    batch_size = x.size(0)
    index = torch.randperm(batch_size).to(x.device)

    mixed_x = lam * x + (1 - lam) * x[index]
    y_a, y_b = y, y[index]

    return mixed_x, y_a, y_b, lam

def mixup_criterion(criterion, pred, y_a, y_b, lam):
    return lam * criterion(pred, y_a) + (1 - lam) * criterion(pred, y_b)

# 학습 시
for x, y in train_loader:
    x, y_a, y_b, lam = mixup_data(x, y, alpha=0.2)
    output = model(x)
    loss = mixup_criterion(criterion, output, y_a, y_b, lam)

CutMix¶

한 이미지의 일부를 다른 이미지로 교체.

def cutmix_data(x, y, alpha=1.0):
    lam = np.random.beta(alpha, alpha)
    batch_size = x.size(0)
    index = torch.randperm(batch_size).to(x.device)

    # 잘라낼 영역 계산
    W, H = x.size(2), x.size(3)
    cut_rat = np.sqrt(1. - lam)
    cut_w = int(W * cut_rat)
    cut_h = int(H * cut_rat)

    cx = np.random.randint(W)
    cy = np.random.randint(H)

    bbx1 = np.clip(cx - cut_w // 2, 0, W)
    bby1 = np.clip(cy - cut_h // 2, 0, H)
    bbx2 = np.clip(cx + cut_w // 2, 0, W)
    bby2 = np.clip(cy + cut_h // 2, 0, H)

    x[:, :, bbx1:bbx2, bby1:bby2] = x[index, :, bbx1:bbx2, bby1:bby2]

    # 실제 비율로 lambda 조정
    lam = 1 - ((bbx2 - bbx1) * (bby2 - bby1) / (W * H))

    return x, y, y[index], lam

Label Smoothing¶

원-핫 레이블을 부드럽게 만들어 과신뢰 방지.

\[y_{smooth} = (1 - \epsilon) \cdot y_{hard} + \frac{\epsilon}{K}\]

class LabelSmoothingCrossEntropy(nn.Module):
    def __init__(self, smoothing=0.1):
        super().__init__()
        self.smoothing = smoothing

    def forward(self, pred, target):
        n_classes = pred.size(-1)

        # 원-핫 변환 및 스무딩
        one_hot = torch.zeros_like(pred).scatter(1, target.unsqueeze(1), 1)
        smooth_one_hot = one_hot * (1 - self.smoothing) + self.smoothing / n_classes

        log_prob = F.log_softmax(pred, dim=-1)
        loss = -(smooth_one_hot * log_prob).sum(dim=-1).mean()

        return loss

# PyTorch 내장 (1.10+)
criterion = nn.CrossEntropyLoss(label_smoothing=0.1)

정규화 기법 조합¶

권장 조합¶

모델	권장 정규화
CNN	Dropout(0.5), Weight Decay, Data Augmentation
Transformer	Dropout(0.1), Weight Decay(0.01~0.1), Label Smoothing
LLM 사전학습	Weight Decay만 (Dropout 거의 사용 안 함)
Fine-tuning	Dropout 증가, 작은 LR

실제 예시¶

# ViT 학습 설정
class ViTTrainingConfig:
    dropout = 0.1
    attention_dropout = 0.0
    drop_path = 0.1  # Stochastic depth
    weight_decay = 0.05
    label_smoothing = 0.1
    mixup_alpha = 0.8
    cutmix_alpha = 1.0

디버깅 가이드¶

과적합 진단¶

def check_overfitting(train_losses, val_losses, threshold=0.1):
    """과적합 여부 확인"""
    if len(train_losses) < 10:
        return False, "Not enough data"

    # 최근 손실 비교
    recent_train = np.mean(train_losses[-5:])
    recent_val = np.mean(val_losses[-5:])
    gap = recent_val - recent_train

    # 검증 손실 추세
    val_trend = np.polyfit(range(len(val_losses[-10:])), val_losses[-10:], 1)[0]

    if gap > threshold and val_trend > 0:
        return True, f"Gap: {gap:.4f}, Val trend: {val_trend:.4f}"
    return False, f"Gap: {gap:.4f}, Val trend: {val_trend:.4f}"

정규화 강도 튜닝¶

# Grid Search
regularization_configs = [
    {'dropout': 0.0, 'weight_decay': 0.0},      # 기준선
    {'dropout': 0.1, 'weight_decay': 0.01},
    {'dropout': 0.2, 'weight_decay': 0.01},
    {'dropout': 0.1, 'weight_decay': 0.1},
    {'dropout': 0.3, 'weight_decay': 0.01},
]

for config in regularization_configs:
    model = create_model(**config)
    val_acc = train_and_evaluate(model)
    print(f"{config} -> Val Acc: {val_acc:.4f}")