신경망 기초 (Neural Network Basics)¶

생물학적 신경망에서 영감을 받은 계산 모델. 입력을 받아 비선형 변환을 통해 출력을 생성함.

퍼셉트론 (Perceptron)¶

가장 단순한 신경망 단위.

구조¶

neural-networks diagram 1

수식¶

\[z = \sum_{i=1}^{n} w_i x_i + b = \mathbf{w}^T \mathbf{x} + b$$ $$y = f(z)\]

import numpy as np

class Perceptron:
    def __init__(self, n_inputs):
        self.weights = np.random.randn(n_inputs) * 0.01
        self.bias = 0

    def forward(self, x):
        z = np.dot(self.weights, x) + self.bias
        return self.activation(z)

    def activation(self, z):
        return 1 if z > 0 else 0  # Step function

    def train(self, X, y, learning_rate=0.1, epochs=100):
        for _ in range(epochs):
            for xi, yi in zip(X, y):
                prediction = self.forward(xi)
                error = yi - prediction
                self.weights += learning_rate * error * xi
                self.bias += learning_rate * error

한계: XOR 문제¶

단층 퍼셉트론은 선형 분리 가능한 문제만 해결.

AND, OR: 선형 분리 가능 (단층으로 해결)
XOR: 선형 분리 불가능 (다층 필요)

XOR 진리표:
x1  x2  y
0   0   0
0   1   1
1   0   1
1   1   0

다층 퍼셉트론 (MLP)¶

여러 층을 쌓아 비선형 문제 해결.

구조¶

neural-networks diagram 2

수식¶

Layer $l$의 출력: $$\mathbf{z}^{(l)} = \mathbf{W}^{(l)} \mathbf{a}^{(l-1)} + \mathbf{b}^{(l)}$$ $$\mathbf{a}^{(l)} = f(\mathbf{z}^{(l)})$$

import torch
import torch.nn as nn

class MLP(nn.Module):
    def __init__(self, input_dim, hidden_dims, output_dim):
        super().__init__()

        layers = []
        prev_dim = input_dim

        for hidden_dim in hidden_dims:
            layers.append(nn.Linear(prev_dim, hidden_dim))
            layers.append(nn.ReLU())
            prev_dim = hidden_dim

        layers.append(nn.Linear(prev_dim, output_dim))

        self.network = nn.Sequential(*layers)

    def forward(self, x):
        return self.network(x)

# 사용
model = MLP(input_dim=784, hidden_dims=[256, 128], output_dim=10)
output = model(torch.randn(32, 784))  # (batch_size, input_dim)

XOR 문제 해결¶

# XOR 데이터
X = torch.tensor([[0, 0], [0, 1], [1, 0], [1, 1]], dtype=torch.float32)
y = torch.tensor([[0], [1], [1], [0]], dtype=torch.float32)

# 2층 MLP
model = nn.Sequential(
    nn.Linear(2, 4),
    nn.ReLU(),
    nn.Linear(4, 1),
    nn.Sigmoid()
)

criterion = nn.BCELoss()
optimizer = torch.optim.Adam(model.parameters(), lr=0.1)

for epoch in range(1000):
    output = model(X)
    loss = criterion(output, y)
    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

print(model(X).round())  # [[0], [1], [1], [0]]

순전파 (Forward Propagation)¶

입력에서 출력까지 순차적으로 계산.

계산 과정¶

def forward_pass(X, weights, biases, activation_fn):
    """
    X: 입력 (batch_size, input_dim)
    weights: 각 층의 가중치 리스트
    biases: 각 층의 편향 리스트
    """
    activations = [X]

    for i, (W, b) in enumerate(zip(weights, biases)):
        z = activations[-1] @ W + b  # 선형 변환

        if i < len(weights) - 1:
            a = activation_fn(z)  # 은닉층: 활성화 적용
        else:
            a = z  # 출력층: 활성화 없음 (또는 Softmax)

        activations.append(a)

    return activations

# 예시
X = np.random.randn(32, 784)  # 배치
W1 = np.random.randn(784, 256) * 0.01
b1 = np.zeros(256)
W2 = np.random.randn(256, 10) * 0.01
b2 = np.zeros(10)

activations = forward_pass(X, [W1, W2], [b1, b2], lambda x: np.maximum(0, x))
output = activations[-1]  # (32, 10)

벡터화 연산¶

# 개별 샘플 처리 (느림)
for i in range(batch_size):
    z = np.dot(W, X[i]) + b

# 벡터화 (빠름)
Z = X @ W + b  # 행렬 곱셈

Universal Approximation Theorem¶

충분히 넓은 단일 은닉층 MLP는 임의의 연속 함수를 근사할 수 있음.

조건:
1. 은닉층에 비선형 활성화 함수
2. 충분한 수의 은닉 뉴런

실제로는:
- 깊은 네트워크가 더 효율적
- 더 적은 파라미터로 복잡한 함수 표현

신경망 구성 요소¶

층의 종류¶

층	설명	사용
Linear (Dense)	완전 연결	MLP, 출력층
Conv	합성곱	이미지, 시퀀스
RNN/LSTM	순환	시퀀스
Attention	어텐션	Transformer
Embedding	임베딩	범주형, 텍스트

PyTorch 층 예시¶

import torch.nn as nn

# 완전 연결층
linear = nn.Linear(in_features=256, out_features=128)

# 합성곱층
conv = nn.Conv2d(in_channels=3, out_channels=64, kernel_size=3, padding=1)

# LSTM
lstm = nn.LSTM(input_size=256, hidden_size=512, num_layers=2, batch_first=True)

# 임베딩층
embedding = nn.Embedding(num_embeddings=50000, embedding_dim=768)

# 어텐션 (Transformer)
attention = nn.MultiheadAttention(embed_dim=512, num_heads=8)

가중치 초기화¶

초기화 방법¶

방법	수식	활성화 함수
Xavier/Glorot	$\mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})$	Sigmoid, Tanh
He/Kaiming	$\mathcal{N}(0, \sqrt{2/n_{in}})$	ReLU
LeCun	$\mathcal{N}(0, \sqrt{1/n_{in}})$	SELU

# PyTorch 초기화
def init_weights(module):
    if isinstance(module, nn.Linear):
        # He 초기화 (ReLU용)
        nn.init.kaiming_normal_(module.weight, nonlinearity='relu')
        if module.bias is not None:
            nn.init.zeros_(module.bias)

    elif isinstance(module, nn.Embedding):
        nn.init.normal_(module.weight, mean=0, std=0.02)

model.apply(init_weights)

# 또는 직접
nn.init.xavier_uniform_(layer.weight)
nn.init.zeros_(layer.bias)

초기화가 중요한 이유¶

너무 작은 초기화:
- 기울기 소실
- 모든 층의 출력이 0에 수렴

너무 큰 초기화:
- 기울기 폭발
- 학습 불안정

적절한 초기화:
- 분산이 층을 지나도 유지
- 안정적인 학습

파라미터 수 계산¶

def count_parameters(model):
    """모델의 학습 가능한 파라미터 수 계산"""
    return sum(p.numel() for p in model.parameters() if p.requires_grad)

# Linear(784, 256): 784 * 256 + 256 = 200,960
# Linear(256, 10): 256 * 10 + 10 = 2,570
# 총: 203,530

model = MLP(784, [256], 10)
print(f"Parameters: {count_parameters(model):,}")

신경망 설계 원칙¶

깊이 vs 너비¶

얕고 넓은 네트워크:
- 병렬화 용이
- 특정 문제에 과적합 위험

깊고 좁은 네트워크:
- 계층적 표현 학습
- 기울기 소실/폭발 위험
- 잔차 연결로 해결

병목 구조¶

# 병목 블록 (ResNet)
class Bottleneck(nn.Module):
    def __init__(self, in_channels, out_channels):
        super().__init__()
        mid = out_channels // 4

        self.conv1 = nn.Conv2d(in_channels, mid, 1)
        self.conv2 = nn.Conv2d(mid, mid, 3, padding=1)
        self.conv3 = nn.Conv2d(mid, out_channels, 1)
        self.relu = nn.ReLU()

    def forward(self, x):
        out = self.relu(self.conv1(x))
        out = self.relu(self.conv2(out))
        out = self.conv3(out)
        return self.relu(out + x)  # Residual connection

CNN (Convolutional Neural Network)¶

이미지, 시퀀스 등 공간적/시간적 구조를 가진 데이터에 특화된 아키텍처.

핵심 아이디어¶

1. 지역적 연결 (Local Connectivity)
   - 모든 픽셀이 아닌 인접 픽셀끼리만 연결
   - 파라미터 수 대폭 감소

2. 가중치 공유 (Weight Sharing)
   - 동일한 필터를 이미지 전체에 적용
   - 위치 불변 특징 추출

3. 계층적 특징 학습
   - 초기 층: 엣지, 색상
   - 중간 층: 텍스처, 패턴
   - 후기 층: 객체 부분, 의미적 특징

기본 연산¶

import torch.nn as nn

# 합성곱 층
conv = nn.Conv2d(
    in_channels=3,      # 입력 채널 (RGB)
    out_channels=64,    # 출력 채널 (필터 수)
    kernel_size=3,      # 필터 크기
    stride=1,           # 이동 간격
    padding=1           # 패딩 (same padding)
)

# 출력 크기 계산
# H_out = (H_in + 2*padding - kernel_size) / stride + 1

# 풀링 층 (다운샘플링)
maxpool = nn.MaxPool2d(kernel_size=2, stride=2)
avgpool = nn.AvgPool2d(kernel_size=2, stride=2)
adaptive_pool = nn.AdaptiveAvgPool2d((1, 1))  # Global Average Pooling

간단한 CNN¶

class SimpleCNN(nn.Module):
    def __init__(self, num_classes=10):
        super().__init__()

        self.features = nn.Sequential(
            # Conv Block 1
            nn.Conv2d(3, 32, kernel_size=3, padding=1),
            nn.BatchNorm2d(32),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Conv Block 2
            nn.Conv2d(32, 64, kernel_size=3, padding=1),
            nn.BatchNorm2d(64),
            nn.ReLU(),
            nn.MaxPool2d(2),

            # Conv Block 3
            nn.Conv2d(64, 128, kernel_size=3, padding=1),
            nn.BatchNorm2d(128),
            nn.ReLU(),
            nn.AdaptiveAvgPool2d((1, 1))
        )

        self.classifier = nn.Linear(128, num_classes)

    def forward(self, x):
        x = self.features(x)
        x = x.view(x.size(0), -1)  # Flatten
        x = self.classifier(x)
        return x

# 입력: (batch, 3, 32, 32) -> 출력: (batch, 10)
model = SimpleCNN()
output = model(torch.randn(16, 3, 32, 32))

RNN (Recurrent Neural Network)¶

시퀀스 데이터를 처리하기 위한 아키텍처. 이전 시점의 정보를 현재 시점에 전달.

기본 구조¶

neural-networks diagram 3

구현¶

# 기본 RNN
rnn = nn.RNN(
    input_size=256,     # 입력 특징 차원
    hidden_size=512,    # 은닉 상태 차원
    num_layers=2,       # RNN 층 수
    batch_first=True,   # (batch, seq, feature) 형태
    dropout=0.1,        # 층 간 드롭아웃
    bidirectional=True  # 양방향
)

# LSTM (Long Short-Term Memory)
# 장기 의존성 학습, 기울기 소실 완화
lstm = nn.LSTM(
    input_size=256,
    hidden_size=512,
    num_layers=2,
    batch_first=True,
    bidirectional=True
)

# GRU (Gated Recurrent Unit)
# LSTM 간소화 버전, 비슷한 성능
gru = nn.GRU(
    input_size=256,
    hidden_size=512,
    num_layers=2,
    batch_first=True
)

# 사용
x = torch.randn(32, 100, 256)  # (batch, seq_len, input_size)
output, hidden = rnn(x)        # output: (32, 100, 1024) - bidirectional

시퀀스 분류¶

class TextClassifier(nn.Module):
    def __init__(self, vocab_size, embed_dim, hidden_dim, num_classes):
        super().__init__()
        self.embedding = nn.Embedding(vocab_size, embed_dim)
        self.lstm = nn.LSTM(embed_dim, hidden_dim, batch_first=True, bidirectional=True)
        self.classifier = nn.Linear(hidden_dim * 2, num_classes)

    def forward(self, x):
        embedded = self.embedding(x)           # (batch, seq, embed)
        output, (hidden, cell) = self.lstm(embedded)

        # 마지막 은닉 상태 사용
        hidden = torch.cat([hidden[-2], hidden[-1]], dim=-1)  # 양방향 결합
        logits = self.classifier(hidden)
        return logits

실무 사용 가이드¶

아키텍처 선택 기준¶

데이터	권장 아키텍처	이유
이미지 분류	CNN (ResNet, EfficientNet)	공간적 불변성
시계열 예측	LSTM, Transformer	시간적 의존성
텍스트 분류	Transformer (BERT)	장거리 의존성
표 형태 데이터	MLP, Gradient Boosting	구조가 단순

모델 크기 결정¶

def estimate_model_size(model, input_size=(1, 3, 224, 224)):
    """모델 크기 및 연산량 추정"""
    from torchinfo import summary

    info = summary(model, input_size=input_size, verbose=0)

    params_mb = info.total_params * 4 / (1024 ** 2)  # float32 기준

    print(f"파라미터 수: {info.total_params:,}")
    print(f"모델 크기: {params_mb:.2f} MB")
    print(f"예상 MACs: {info.total_mult_adds:,}")

    return info

# 메모리 제약 시 고려사항:
# - 작은 은닉 차원
# - 더 적은 층
# - 효율적 아키텍처 (MobileNet, EfficientNet)

디버깅 가이드¶

일반적인 문제와 해결책¶

문제	증상	해결책
학습 안 됨	Loss가 감소 안 함	LR 조정, 초기화 확인
기울기 소실	깊은 층 기울기≈0	ReLU, ResNet, 초기화
기울기 폭발	Loss=NaN, Inf	Gradient Clipping, LR↓
과적합	Train↓, Val↑	Dropout, Data Aug
과소적합	Train↓ 안 됨	모델 용량↑, LR↑

디버깅 체크리스트¶

def debug_training(model, train_loader, criterion, optimizer):
    """학습 디버깅 체크리스트"""

    # 1. 단일 배치 과적합 테스트
    print("=== 단일 배치 과적합 테스트 ===")
    x, y = next(iter(train_loader))

    for i in range(100):
        optimizer.zero_grad()
        output = model(x)
        loss = criterion(output, y)
        loss.backward()
        optimizer.step()

        if i % 20 == 0:
            print(f"Step {i}: Loss = {loss.item():.4f}")

    if loss.item() > 0.1:
        print("[WARNING] 단일 배치도 과적합 못함 - 모델 구조 또는 LR 확인")

    # 2. 기울기 체크
    print("\n=== 기울기 통계 ===")
    for name, param in model.named_parameters():
        if param.grad is not None:
            grad = param.grad
            print(f"{name}: grad_mean={grad.mean():.6f}, grad_std={grad.std():.6f}, "
                  f"grad_max={grad.max():.6f}")

    # 3. 활성화 체크
    print("\n=== 활성화 통계 ===")
    activations = {}

    def hook_fn(name):
        def hook(module, input, output):
            activations[name] = output.detach()
        return hook

    hooks = []
    for name, module in model.named_modules():
        if isinstance(module, (nn.ReLU, nn.GELU)):
            hooks.append(module.register_forward_hook(hook_fn(name)))

    with torch.no_grad():
        _ = model(x)

    for name, act in activations.items():
        dead_pct = 100 * (act == 0).float().mean()
        print(f"{name}: dead_neurons={dead_pct:.1f}%")
        if dead_pct > 50:
            print(f"  [WARNING] 죽은 뉴런 비율 높음!")

    for h in hooks:
        h.remove()

# 사용
debug_training(model, train_loader, criterion, optimizer)

Loss가 감소하지 않을 때¶

# 1. Learning Rate 확인
# 너무 크면: 진동, 발산
# 너무 작으면: 매우 느린 수렴

# LR Range Test
from torch_lr_finder import LRFinder

lr_finder = LRFinder(model, optimizer, criterion)
lr_finder.range_test(train_loader, end_lr=10, num_iter=100)
lr_finder.plot()  # 가파르게 감소하는 지점의 1/10 사용

# 2. 데이터 확인
print("데이터 샘플 확인:")
x, y = next(iter(train_loader))
print(f"X shape: {x.shape}, dtype: {x.dtype}")
print(f"Y shape: {y.shape}, dtype: {y.dtype}")
print(f"X range: [{x.min():.2f}, {x.max():.2f}]")
print(f"Y unique: {torch.unique(y)}")

# 3. 모델 출력 확인
with torch.no_grad():
    output = model(x)
print(f"Output shape: {output.shape}")
print(f"Output range: [{output.min():.2f}, {output.max():.2f}]")

NaN/Inf 발생 시¶

# NaN 감지 훅
def nan_hook(module, input, output):
    if torch.isnan(output).any():
        print(f"NaN detected in {module.__class__.__name__}")
        print(f"Input stats: mean={input[0].mean()}, std={input[0].std()}")
        raise RuntimeError("NaN detected!")

for module in model.modules():
    module.register_forward_hook(nan_hook)

# 일반적인 원인과 해결:
# 1. 학습률이 너무 높음 → LR 낮추기
# 2. 로그에 0 입력 → log(x + eps)
# 3. 나눗셈 문제 → 분모에 eps 추가
# 4. 초기화 문제 → Xavier/He 초기화

방법	수식	활성화 함수
Xavier/Glorot	\(\mathcal{U}(-\sqrt{6/(n_{in}+n_{out})}, \sqrt{6/(n_{in}+n_{out})})\)	Sigmoid, Tanh
He/Kaiming	\(\mathcal{N}(0, \sqrt{2/n_{in}})\)	ReLU
LeCun	\(\mathcal{N}(0, \sqrt{1/n_{in}})\)	SELU