활성화 함수 (Activation Functions)¶

신경망에 비선형성을 추가하는 함수. 비선형 활성화가 없으면 다층 네트워크도 선형 변환에 불과함.

왜 비선형성이 필요한가¶

선형 층의 합성: $$f_2(f_1(x)) = W_2(W_1 x + b_1) + b_2 = W_2 W_1 x + W_2 b_1 + b_2$$

이는 단일 선형 변환 $Wx + b$와 동일. 비선형 활성화로 더 복잡한 함수 학습 가능.

고전적 활성화 함수¶

Sigmoid¶

\[\sigma(x) = \frac{1}{1 + e^{-x}}\]

출력 범위: (0, 1)

import torch
import torch.nn.functional as F
import numpy as np

def sigmoid(x):
    return 1 / (1 + np.exp(-x))

def sigmoid_derivative(x):
    s = sigmoid(x)
    return s * (1 - s)  # 최대값 0.25

# PyTorch
y = torch.sigmoid(x)
y = F.sigmoid(x)

특성: - 장점: 확률로 해석 가능, 출력 범위 제한 - 단점: 기울기 소실 (포화 영역), 비대칭 (0 중심 아님), 계산 비용

Tanh¶

\[\tanh(x) = \frac{e^x - e^{-x}}{e^x + e^{-x}} = 2\sigma(2x) - 1\]

출력 범위: (-1, 1)

def tanh(x):
    return np.tanh(x)

def tanh_derivative(x):
    return 1 - np.tanh(x) ** 2  # 최대값 1

# PyTorch
y = torch.tanh(x)

특성: - 장점: 0 중심, Sigmoid보다 기울기 큼 - 단점: 여전히 포화 문제

ReLU 계열¶

ReLU (Rectified Linear Unit)¶

\[f(x) = \max(0, x)\]

def relu(x):
    return np.maximum(0, x)

def relu_derivative(x):
    return (x > 0).astype(float)

# PyTorch
y = torch.relu(x)
y = F.relu(x)
relu_layer = nn.ReLU()

특성: - 장점: 계산 효율적, 기울기 소실 완화, 희소 활성화 - 단점: 죽은 ReLU 문제, 비유계 출력

Leaky ReLU¶

\[f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha x & \text{if } x \leq 0 \end{cases}\]

보통 $\alpha = 0.01$

def leaky_relu(x, alpha=0.01):
    return np.where(x > 0, x, alpha * x)

# PyTorch
y = F.leaky_relu(x, negative_slope=0.01)
leaky_relu_layer = nn.LeakyReLU(negative_slope=0.01)

특성: - 죽은 ReLU 문제 완화 - 음수 입력에서도 기울기 유지

PReLU (Parametric ReLU)¶

$\alpha$를 학습 파라미터로.

# PyTorch
prelu_layer = nn.PReLU(num_parameters=1)  # 또는 채널별

# 학습 가능한 alpha
class PReLU(nn.Module):
    def __init__(self, init=0.25):
        super().__init__()
        self.alpha = nn.Parameter(torch.tensor(init))

    def forward(self, x):
        return torch.where(x > 0, x, self.alpha * x)

ELU (Exponential Linear Unit)¶

\[f(x) = \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

def elu(x, alpha=1.0):
    return np.where(x > 0, x, alpha * (np.exp(x) - 1))

# PyTorch
y = F.elu(x, alpha=1.0)
elu_layer = nn.ELU(alpha=1.0)

특성: - 음수 출력으로 평균 0에 가까움 - 부드러운 곡선 (미분 연속)

SELU (Scaled ELU)¶

자기 정규화 특성.

\[f(x) = \lambda \begin{cases} x & \text{if } x > 0 \\ \alpha(e^x - 1) & \text{if } x \leq 0 \end{cases}\]

$\lambda \approx 1.0507$, $\alpha \approx 1.6733$

selu_layer = nn.SELU()
# AlphaDropout과 함께 사용

현대적 활성화 함수¶

GELU (Gaussian Error Linear Unit)¶

Transformer에서 주로 사용.

\[\text{GELU}(x) = x \cdot \Phi(x) = x \cdot \frac{1}{2}\left[1 + \text{erf}\left(\frac{x}{\sqrt{2}}\right)\right]\]

근사: $$\text{GELU}(x) \approx 0.5x\left(1 + \tanh\left[\sqrt{2/\pi}(x + 0.044715x^3)\right]\right)$$

def gelu(x):
    return x * 0.5 * (1 + torch.erf(x / np.sqrt(2)))

def gelu_approx(x):
    return 0.5 * x * (1 + torch.tanh(np.sqrt(2/np.pi) * (x + 0.044715 * x**3)))

# PyTorch
y = F.gelu(x)
gelu_layer = nn.GELU()

특성: - 확률적 해석: 입력이 클수록 통과 확률 높음 - 부드러운 비선형성 - GPT, BERT 등에서 사용

Swish / SiLU¶

\[f(x) = x \cdot \sigma(\beta x)\]

$\beta = 1$일 때 SiLU (Sigmoid Linear Unit)

def swish(x, beta=1.0):
    return x * torch.sigmoid(beta * x)

# PyTorch
y = F.silu(x)  # SiLU = Swish-1
silu_layer = nn.SiLU()

특성: - 자기 게이팅 (self-gating) - EfficientNet, 일부 LLM에서 사용

Mish¶

\[f(x) = x \cdot \tanh(\text{softplus}(x)) = x \cdot \tanh(\ln(1 + e^x))\]

def mish(x):
    return x * torch.tanh(F.softplus(x))

mish_layer = nn.Mish()

GLU (Gated Linear Unit)¶

게이트 메커니즘.

\[\text{GLU}(x, W, V) = (xW) \otimes \sigma(xV)\]

class GLU(nn.Module):
    def __init__(self, dim):
        super().__init__()
        self.linear = nn.Linear(dim, dim * 2)

    def forward(self, x):
        x, gate = self.linear(x).chunk(2, dim=-1)
        return x * torch.sigmoid(gate)

# 변형: SwiGLU (LLaMA에서 사용)
class SwiGLU(nn.Module):
    def __init__(self, dim, hidden_dim):
        super().__init__()
        self.w1 = nn.Linear(dim, hidden_dim)
        self.w2 = nn.Linear(hidden_dim, dim)
        self.w3 = nn.Linear(dim, hidden_dim)

    def forward(self, x):
        return self.w2(F.silu(self.w1(x)) * self.w3(x))

출력층 활성화¶

Softmax¶

다중 클래스 분류.

\[\text{softmax}(z_i) = \frac{e^{z_i}}{\sum_{j=1}^{K} e^{z_j}}\]

def softmax(x, dim=-1):
    exp_x = torch.exp(x - x.max(dim=dim, keepdim=True).values)
    return exp_x / exp_x.sum(dim=dim, keepdim=True)

# PyTorch
y = F.softmax(x, dim=-1)

Log Softmax¶

Cross-entropy와 함께 사용 시 수치 안정성.

y = F.log_softmax(x, dim=-1)
loss = F.nll_loss(y, targets)  # NLL + log_softmax = cross_entropy

활성화 함수 비교¶

함수	범위	장점	단점	사용
Sigmoid	(0, 1)	확률 출력	기울기 소실	이진 분류 출력
Tanh	(-1, 1)	0 중심	기울기 소실	RNN
ReLU	[0, ∞)	빠름, 단순	죽은 뉴런	CNN
Leaky ReLU	(-∞, ∞)	죽은 뉴런 방지	-	일반적
GELU	(-∞, ∞)	부드러움	계산 비용	Transformer
SiLU/Swish	(-∞, ∞)	자기 게이팅	계산 비용	EfficientNet

선택 가이드¶

CNN: ReLU, Leaky ReLU
RNN: Tanh (게이트), Sigmoid (게이트)
Transformer: GELU, SiLU
출력층:
  - 이진 분류: Sigmoid
  - 다중 분류: Softmax
  - 회귀: None (Linear)

시각화 비교¶

import matplotlib.pyplot as plt

x = torch.linspace(-5, 5, 100)

functions = {
    'ReLU': F.relu,
    'Leaky ReLU': lambda x: F.leaky_relu(x, 0.1),
    'GELU': F.gelu,
    'SiLU': F.silu,
    'Tanh': torch.tanh,
    'Sigmoid': torch.sigmoid,
}

fig, axes = plt.subplots(2, 3, figsize=(12, 8))
for ax, (name, fn) in zip(axes.flatten(), functions.items()):
    y = fn(x)
    ax.plot(x.numpy(), y.numpy())
    ax.axhline(y=0, color='k', linestyle='--', linewidth=0.5)
    ax.axvline(x=0, color='k', linestyle='--', linewidth=0.5)
    ax.set_title(name)
    ax.grid(True, alpha=0.3)

plt.tight_layout()

실무 가이드¶

활성화 함수 선택 플로우차트¶

activation-functions diagram 1

성능 비교 (실험적)¶

# CIFAR-10에서의 대략적 성능 (동일 구조)
# ReLU:       ~92%
# Leaky ReLU: ~92.5%
# GELU:       ~93%
# SiLU:       ~93%
# Mish:       ~93.5%

# 차이는 작지만 더 부드러운 함수가 약간 유리
# 계산 비용도 고려해야 함

디버깅 가이드¶

죽은 뉴런 (Dead Neurons) 진단¶

def check_dead_neurons(model, dataloader, threshold=0.01):
    """죽은 뉴런 비율 확인"""
    activation_counts = {}

    def count_hook(name):
        def hook(module, input, output):
            if name not in activation_counts:
                activation_counts[name] = {'total': 0, 'active': 0}
            activation_counts[name]['total'] += output.numel()
            activation_counts[name]['active'] += (output > 0).sum().item()
        return hook

    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, nn.ReLU):
            hooks.append(module.register_forward_hook(count_hook(name)))

    model.eval()
    with torch.no_grad():
        for batch in dataloader:
            x = batch[0] if isinstance(batch, (list, tuple)) else batch
            _ = model(x)

    for h in hooks:
        h.remove()

    print("=== 뉴런 활성화 통계 ===")
    for name, counts in activation_counts.items():
        active_ratio = counts['active'] / counts['total']
        status = "OK" if active_ratio > threshold else "WARNING"
        print(f"{name}: {100*active_ratio:.1f}% active [{status}]")

        if active_ratio < threshold:
            print(f"  → 해결책: Leaky ReLU 사용 또는 학습률 낮추기")

# 사용
check_dead_neurons(model, train_loader)

Sigmoid/Tanh 포화 문제 진단¶

def check_saturation(model, dataloader):
    """Sigmoid/Tanh 포화 여부 확인"""
    saturation_stats = {}

    def saturation_hook(name):
        def hook(module, input, output):
            if name not in saturation_stats:
                saturation_stats[name] = []

            # 포화 영역: |output| > 0.9 (Tanh) 또는 output > 0.9 or < 0.1 (Sigmoid)
            if isinstance(module, nn.Tanh):
                saturated = (output.abs() > 0.9).float().mean().item()
            else:  # Sigmoid
                saturated = ((output > 0.9) | (output < 0.1)).float().mean().item()
            saturation_stats[name].append(saturated)
        return hook

    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.Sigmoid, nn.Tanh)):
            hooks.append(module.register_forward_hook(saturation_hook(name)))

    model.eval()
    with torch.no_grad():
        for batch in dataloader:
            x = batch[0] if isinstance(batch, (list, tuple)) else batch
            _ = model(x)
            break  # 한 배치만

    for h in hooks:
        h.remove()

    print("=== 포화 통계 ===")
    for name, stats in saturation_stats.items():
        sat_pct = 100 * sum(stats) / len(stats)
        status = "OK" if sat_pct < 30 else "WARNING"
        print(f"{name}: {sat_pct:.1f}% saturated [{status}]")

        if sat_pct > 30:
            print(f"  → 해결책: 입력 정규화, 가중치 초기화 확인")

# 사용
check_saturation(model, train_loader)

활성화 분포 시각화¶

def visualize_activations(model, sample_input, save_path=None):
    """각 층의 활성화 분포 시각화"""
    activations = {}

    def hook(name):
        def fn(module, input, output):
            activations[name] = output.detach().cpu()
        return fn

    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.ReLU, nn.GELU, nn.SiLU, nn.Tanh, nn.Sigmoid)):
            hooks.append(module.register_forward_hook(hook(name)))

    with torch.no_grad():
        _ = model(sample_input)

    for h in hooks:
        h.remove()

    # 시각화
    n = len(activations)
    fig, axes = plt.subplots(1, n, figsize=(4*n, 4))
    if n == 1:
        axes = [axes]

    for ax, (name, act) in zip(axes, activations.items()):
        ax.hist(act.flatten().numpy(), bins=50, density=True, alpha=0.7)
        ax.set_title(f'{name}\nmean={act.mean():.3f}, std={act.std():.3f}')
        ax.axvline(x=0, color='r', linestyle='--', alpha=0.5)

    plt.tight_layout()
    if save_path:
        plt.savefig(save_path)
    plt.show()

# 사용
sample = torch.randn(32, 784)
visualize_activations(model, sample)

일반적인 문제와 해결¶

문제	증상	해결책
죽은 ReLU	많은 뉴런 출력이 항상 0	Leaky ReLU, PReLU, He 초기화
포화	Sigmoid/Tanh 출력이 극단값	입력 정규화, BatchNorm
기울기 소실	깊은 층 기울기가 매우 작음	ReLU 계열, 잔차 연결
출력 불안정	활성화 값이 너무 큼/작음	정규화, 적절한 초기화

커스텀 활성화 함수 만들기¶

class CustomActivation(nn.Module):
    """예: Mish의 변형"""
    def __init__(self, beta=1.0):
        super().__init__()
        self.beta = beta

    def forward(self, x):
        return x * torch.tanh(F.softplus(self.beta * x))

# torch.autograd.Function으로 더 효율적인 구현
class MishFunction(torch.autograd.Function):
    @staticmethod
    def forward(ctx, x):
        ctx.save_for_backward(x)
        return x * torch.tanh(F.softplus(x))

    @staticmethod
    def backward(ctx, grad_output):
        x, = ctx.saved_tensors
        sp = F.softplus(x)
        grad_x = torch.tanh(sp) + x * (1 - torch.tanh(sp) ** 2) * torch.sigmoid(x)
        return grad_output * grad_x

mish = MishFunction.apply