강화학습 개요¶
강화학습(Reinforcement Learning, RL)은 에이전트가 환경과 상호작용하며 보상을 최대화하는 행동 정책을 학습하는 기계학습 패러다임. 게임 AI, 로봇 제어, 추천 시스템 등에 활용됨.
핵심 개념¶
MDP (Markov Decision Process)¶
| 구성 요소 | 설명 |
|---|---|
| \(S\) | 상태 공간 (State space) |
| \(A\) | 행동 공간 (Action space) |
| $P(s' | s,a)$ |
| \(R(s,a,s')\) | 보상 함수 (Reward function) |
| \(\gamma\) | 할인 계수 (Discount factor, 0~1) |
핵심 수식¶
기대 보상:
가치 함수:
행동 가치 함수:
Bellman Equation:
알고리즘 분류 체계¶
Reinforcement Learning
├── Value-based
│ ├── Tabular
│ │ ├── Dynamic Programming (Policy/Value Iteration)
│ │ ├── Monte Carlo
│ │ └── TD Learning (SARSA, Q-Learning)
│ └── Deep
│ ├── DQN (Deep Q-Network)
│ ├── Double DQN
│ ├── Dueling DQN
│ └── Rainbow
├── Policy-based
│ ├── Policy Gradient (REINFORCE)
│ ├── Actor-Critic (A2C, A3C)
│ ├── PPO (Proximal Policy Optimization)
│ └── TRPO (Trust Region Policy Optimization)
├── Model-based
│ ├── Dyna-Q
│ ├── World Models
│ └── MuZero
├── Offline RL
│ ├── BCQ, CQL
│ └── Decision Transformer
├── Multi-agent RL
│ └── MADDPG, QMIX
└── Inverse RL
└── GAIL, IRL
Value-based Methods¶
Q-Learning¶
import numpy as np
class QLearning:
def __init__(self, n_states, n_actions, lr=0.1, gamma=0.99, epsilon=0.1):
self.Q = np.zeros((n_states, n_actions))
self.lr = lr
self.gamma = gamma
self.epsilon = epsilon
def choose_action(self, state):
if np.random.random() < self.epsilon:
return np.random.randint(self.Q.shape[1])
return np.argmax(self.Q[state])
def update(self, state, action, reward, next_state, done):
target = reward + (1 - done) * self.gamma * np.max(self.Q[next_state])
self.Q[state, action] += self.lr * (target - self.Q[state, action])
DQN (Deep Q-Network)¶
신경망으로 Q 함수 근사:
핵심 기법: - Experience Replay: 과거 경험 재사용 - Target Network: 학습 안정화
import torch
import torch.nn as nn
from collections import deque
import random
class DQN(nn.Module):
def __init__(self, state_dim, action_dim, hidden_dim=128):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, hidden_dim),
nn.ReLU(),
nn.Linear(hidden_dim, action_dim)
)
def forward(self, x):
return self.network(x)
class ReplayBuffer:
def __init__(self, capacity):
self.buffer = deque(maxlen=capacity)
def push(self, state, action, reward, next_state, done):
self.buffer.append((state, action, reward, next_state, done))
def sample(self, batch_size):
return random.sample(self.buffer, batch_size)
def __len__(self):
return len(self.buffer)
참고 논문: - Mnih, V. et al. (2015). "Human-level Control through Deep Reinforcement Learning". Nature.
Policy-based Methods¶
Policy Gradient (REINFORCE)¶
class PolicyNetwork(nn.Module):
def __init__(self, state_dim, action_dim):
super().__init__()
self.network = nn.Sequential(
nn.Linear(state_dim, 128),
nn.ReLU(),
nn.Linear(128, action_dim),
nn.Softmax(dim=-1)
)
def forward(self, x):
return self.network(x)
def reinforce_update(policy, optimizer, rewards, log_probs, gamma=0.99):
# 할인된 보상 계산
returns = []
G = 0
for r in reversed(rewards):
G = r + gamma * G
returns.insert(0, G)
returns = torch.tensor(returns)
returns = (returns - returns.mean()) / (returns.std() + 1e-8)
# Policy gradient
loss = 0
for log_prob, G in zip(log_probs, returns):
loss -= log_prob * G
optimizer.zero_grad()
loss.backward()
optimizer.step()
PPO (Proximal Policy Optimization)¶
클리핑으로 정책 변화 제한:
여기서 \(r_t(\theta) = \frac{\pi_\theta(a_t|s_t)}{\pi_{\theta_{old}}(a_t|s_t)}\)
참고 논문: - Schulman, J. et al. (2017). "Proximal Policy Optimization Algorithms". arXiv.
Actor-Critic¶
Policy (Actor) + Value (Critic) 동시 학습:
여기서 Advantage \(A(s, a) = Q(s, a) - V(s) \approx r + \gamma V(s') - V(s)\)
Model-based RL¶
World Models¶
환경 모델을 학습하고 시뮬레이션에서 정책 학습:
MuZero (DeepMind, 2020)¶
- 명시적 규칙 없이 게임 마스터
- 학습된 모델로 Monte Carlo Tree Search
참고 논문: - Schrittwieser, J. et al. (2020). "Mastering Atari, Go, Chess and Shogi by Planning with a Learned Model". Nature.
Offline RL¶
사전 수집된 데이터로 학습 (환경 상호작용 없이):
Challenge: - Distribution shift: 학습 데이터와 다른 상태 방문 - Overestimation: 학습되지 않은 행동의 Q값 과대평가
Conservative Q-Learning (CQL):
Decision Transformer: RL을 시퀀스 모델링으로 재구성
참고 논문: - Kumar, A. et al. (2020). "Conservative Q-Learning for Offline Reinforcement Learning". NeurIPS. - Chen, L. et al. (2021). "Decision Transformer". NeurIPS.
실무 라이브러리¶
# Stable Baselines3
from stable_baselines3 import PPO
env = gym.make("CartPole-v1")
model = PPO("MlpPolicy", env, verbose=1)
model.learn(total_timesteps=100000)
# 평가
obs, _ = env.reset()
for _ in range(1000):
action, _ = model.predict(obs, deterministic=True)
obs, reward, terminated, truncated, info = env.step(action)
참고 문헌¶
교과서¶
- Sutton, R.S. & Barto, A.G. (2018). "Reinforcement Learning: An Introduction" (2nd ed). MIT Press. (무료 온라인)
핵심 논문¶
- Mnih, V. et al. (2015). "DQN". Nature.
- Schulman, J. et al. (2017). "PPO". arXiv.
- Silver, D. et al. (2016). "Mastering the Game of Go with Deep Neural Networks". Nature.
라이브러리¶
- Stable Baselines3: https://stable-baselines3.readthedocs.io/
- RLlib (Ray): https://docs.ray.io/en/latest/rllib/
- CleanRL: https://github.com/vwxyzjn/cleanrl