강화학습 개요¶

강화학습(Reinforcement Learning, RL)은 에이전트가 환경과 상호작용하며 보상을 최대화하는 행동 정책을 학습하는 기계학습 패러다임. 게임 AI, 로봇 제어, 추천 시스템 등에 활용됨.

핵심 개념¶

MDP (Markov Decision Process)¶

\[\mathcal{M} = (S, A, P, R, \gamma)\]

구성 요소	설명
$S$	상태 공간 (State space)
$A$	행동 공간 (Action space)
$P(s'	s,a)$
$R(s,a,s')$	보상 함수 (Reward function)
$\gamma$	할인 계수 (Discount factor, 0~1)

핵심 수식¶

기대 보상:

\[G_t = R_{t+1} + \gamma R_{t+2} + \gamma^2 R_{t+3} + ... = \sum_{k=0}^{\infty} \gamma^k R_{t+k+1}\]

가치 함수:

\[V^\pi(s) = \mathbb{E}_\pi[G_t | S_t = s]\]

행동 가치 함수:

\[Q^\pi(s, a) = \mathbb{E}_\pi[G_t | S_t = s, A_t = a]\]

Bellman Equation:

\[V^\pi(s) = \sum_a \pi(a|s) \sum_{s'} P(s'|s,a)[R(s,a,s') + \gamma V^\pi(s')]\]

알고리즘 분류 체계¶

Reinforcement Learning
├── Value-based
│   ├── Tabular
│   │   ├── Dynamic Programming (Policy/Value Iteration)
│   │   ├── Monte Carlo
│   │   └── TD Learning (SARSA, Q-Learning)
│   └── Deep
│       ├── DQN (Deep Q-Network)
│       ├── Double DQN
│       ├── Dueling DQN
│       └── Rainbow
├── Policy-based
│   ├── Policy Gradient (REINFORCE)
│   ├── Actor-Critic (A2C, A3C)
│   ├── PPO (Proximal Policy Optimization)
│   └── TRPO (Trust Region Policy Optimization)
├── Model-based
│   ├── Dyna-Q
│   ├── World Models
│   └── MuZero
├── Offline RL
│   ├── BCQ, CQL
│   └── Decision Transformer
├── Multi-agent RL
│   └── MADDPG, QMIX
└── Inverse RL
    └── GAIL, IRL

Value-based Methods¶

Q-Learning¶

\[Q(s, a) \leftarrow Q(s, a) + \alpha[R + \gamma \max_{a'} Q(s', a') - Q(s, a)]\]

import numpy as np

class QLearning:
    def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
        self.Q = np.zeros((n_states, n_actions))
        self.lr = lr
        self.gamma = gamma
        self.epsilon = epsilon

    def choose_action(self, state):
        if np.random.random() < self.epsilon:
            return np.random.randint(self.Q.shape[1])
        return np.argmax(self.Q[state])

    def update(self, state, action, reward, next_state, done):
        target = reward + (1 - done) * self.gamma * np.max(self.Q[next_state])
        self.Q[state, action] += self.lr * (target - self.Q[state, action])

DQN (Deep Q-Network)¶

신경망으로 Q 함수 근사:

\[L(\theta) = \mathbb{E}[(r + \gamma \max_{a'} Q(s', a'; \theta^-) - Q(s, a; \theta))^2]\]

핵심 기법: - Experience Replay: 과거 경험 재사용 - Target Network: 학습 안정화

import torch
import torch.nn as nn
from collections import deque
import random

class DQN(nn.Module):
    def __init__(self, state_dim, action_dim, hidden_dim=128):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.ReLU(),
            nn.Linear(hidden_dim, action_dim)
        )

    def forward(self, x):
        return self.network(x)

class ReplayBuffer:
    def __init__(self, capacity):
        self.buffer = deque(maxlen=capacity)

    def push(self, state, action, reward, next_state, done):
        self.buffer.append((state, action, reward, next_state, done))

    def sample(self, batch_size):
        return random.sample(self.buffer, batch_size)

    def __len__(self):
        return len(self.buffer)

참고 논문: - Mnih, V. et al. (2015). "Human-level Control through Deep Reinforcement Learning". Nature.

Policy-based Methods¶

Policy Gradient (REINFORCE)¶

\[\nabla_\theta J(\theta) = \mathbb{E}_\pi[\nabla_\theta \log \pi_\theta(a|s) \cdot G_t]\]

class PolicyNetwork(nn.Module):
    def __init__(self, state_dim, action_dim):
        super().__init__()
        self.network = nn.Sequential(
            nn.Linear(state_dim, 128),
            nn.ReLU(),
            nn.Linear(128, action_dim),
            nn.Softmax(dim=-1)
        )

    def forward(self, x):
        return self.network(x)

def reinforce_update(policy, optimizer, rewards, log_probs, gamma=0.99):
    # 할인된 보상 계산
    returns = []
    G = 0
    for r in reversed(rewards):
        G = r + gamma * G
        returns.insert(0, G)
    returns = torch.tensor(returns)
    returns = (returns - returns.mean()) / (returns.std() + 1e-8)

    # Policy gradient
    loss = 0
    for log_prob, G in zip(log_probs, returns):
        loss -= log_prob * G

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

PPO (Proximal Policy Optimization)¶

클리핑으로 정책 변화 제한:

\[L^{CLIP}(\theta) = \mathbb{E}[\min(r_t(\theta)\hat{A}_t, \text{clip}(r_t(\theta), 1-\epsilon, 1+\epsilon)\hat{A}_t)]\]

여기서 $r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}$

참고 논문: - Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms". arXiv.

Actor-Critic¶

Policy (Actor) + Value (Critic) 동시 학습:

\[\nabla_\theta J(\theta) = \mathbb{E}[\nabla_\theta \log \pi_\theta(a|s) \cdot A(s, a)]\]

여기서 Advantage $A(s, a) = Q(s, a) - V(s) \approx r + \gamma V(s') - V(s)$

Model-based RL¶

World Models¶

환경 모델을 학습하고 시뮬레이션에서 정책 학습:

1. 환경 상호작용 데이터 수집
2. 전이 모델 학습: p(s'|s, a)
3. 학습된 모델에서 정책 학습 (Dyna)

MuZero (DeepMind, 2020)¶

명시적 규칙 없이 게임 마스터
학습된 모델로 Monte Carlo Tree Search

참고 논문: - Schrittwieser, J. et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model". Nature.

Offline RL¶

사전 수집된 데이터로 학습 (환경 상호작용 없이):

Challenge: - Distribution shift: 학습 데이터와 다른 상태 방문 - Overestimation: 학습되지 않은 행동의 Q값 과대평가

Conservative Q-Learning (CQL):

\[\min_Q \alpha \mathbb{E}_{s \sim D}[\log \sum_a \exp(Q(s,a))] - \mathbb{E}_{(s,a) \sim D}[Q(s,a)] + \frac{1}{2} \mathbb{E}[\text{TD error}^2]\]

Decision Transformer: RL을 시퀀스 모델링으로 재구성

참고 논문: - Kumar, A. et al. (2020). "Conservative Q-Learning for Offline Reinforcement Learning". NeurIPS. - Chen, L. et al. (2021). "Decision Transformer". NeurIPS.

실무 라이브러리¶

# Stable Baselines3
from stable_baselines3 import PPO

env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)

# 평가
obs, _ = env.reset()
for _ in range(1000):
    action, _ = model.predict(obs, deterministic=True)
    obs, reward, terminated, truncated, info = env.step(action)

참고 문헌¶

교과서¶

Sutton, R.S. & Barto, A.G. (2018). "Reinforcement Learning: An Introduction" (2nd ed). MIT Press. (무료 온라인)

핵심 논문¶

Mnih, V. et al. (2015). "DQN". Nature.
Schulman, J. et al. (2017). "PPO". arXiv.
Silver, D. et al. (2016). "Mastering the Game of Go with Deep Neural Networks". Nature.

라이브러리¶

Stable Baselines3: https://stable-baselines3.readthedocs.io/
RLlib (Ray): https://docs.ray.io/en/latest/rllib/
CleanRL: https://github.com/vwxyzjn/cleanrl

구성 요소	설명
\(S\)	상태 공간 (State space)
\(A\)	행동 공간 (Action space)
$P(s'	s,a)$
\(R(s,a,s')\)	보상 함수 (Reward function)
\(\gamma\)	할인 계수 (Discount factor, 0~1)