Online Learning Theory¶

개요¶

NeurIPS 2025 Best Paper (Runner-up)으로 선정된 연구. Online Learning의 새로운 이론적 결과로, regret bound의 타이트한 분석과 비정상 환경(non-stationary)에서의 학습 알고리즘을 제시한다.

기초 개념¶

Online Learning 설정¶

라운드 t = 1, 2, 3, ...
┌─────────────────────────────────────────────────────────────────┐
│  1. 학습자가 예측/결정 x_t 선택                                   │
│  2. 적대자(또는 환경)가 손실 함수 f_t 공개                         │
│  3. 학습자가 손실 f_t(x_t) 관측                                   │
│  4. 다음 라운드로                                                │
└─────────────────────────────────────────────────────────────────┘

Regret 정의¶

Static Regret (정적 최적 대비): $$ R_T = \sum_{t=1}^{T} f_t(x_t) - \min_{x \in \mathcal{X}} \sum_{t=1}^{T} f_t(x) $$

Dynamic Regret (변화하는 최적 대비): $$ R_T^{dyn} = \sum_{t=1}^{T} f_t(x_t) - \sum_{t=1}^{T} f_t(x_t^*) $$

여기서 $x_t^* = \arg\min_x f_t(x)$

알고리즘 목표¶

환경 유형	목표	최적 Regret
정적 (Stationary)	Static Regret 최소화	O(√T)
비정적 (Non-stationary)	Dynamic Regret 최소화	O(√(T·V_T))
적대적 (Adversarial)	Worst-case Regret	O(√T)

핵심 알고리즘¶

1. Online Gradient Descent (OGD)¶

import numpy as np

class OnlineGradientDescent:
    """Online Gradient Descent"""

    def __init__(self, dim: int, eta: float = None, diameter: float = 1.0):
        self.dim = dim
        self.diameter = diameter
        self.x = np.zeros(dim)
        self.t = 0
        self.eta = eta  # None이면 자동 설정

    def predict(self) -> np.ndarray:
        return self.x.copy()

    def update(self, gradient: np.ndarray):
        """그래디언트 업데이트"""
        self.t += 1

        # 학습률 설정
        if self.eta is None:
            eta = self.diameter / (np.linalg.norm(gradient) * np.sqrt(self.t))
        else:
            eta = self.eta

        # 그래디언트 스텝
        self.x = self.x - eta * gradient

        # 프로젝션 (feasible set으로)
        self.x = self._project(self.x)

    def _project(self, x: np.ndarray) -> np.ndarray:
        """L2 볼에 프로젝션"""
        norm = np.linalg.norm(x)
        if norm > self.diameter:
            x = x * self.diameter / norm
        return x

Regret Bound: $$ R_T \leq \frac{D^2}{2\eta} + \frac{\eta}{2} \sum_{t=1}^{T} |g_t|^2 $$

최적 $\eta = D / (G\sqrt{T})$ 일 때: $$ R_T \leq DG\sqrt{T} $$

2. Follow The Regularized Leader (FTRL)¶

class FTRL:
    """Follow The Regularized Leader"""

    def __init__(self, dim: int, eta: float = 1.0, regularizer: str = 'l2'):
        self.dim = dim
        self.eta = eta
        self.regularizer = regularizer
        self.cumulative_grad = np.zeros(dim)
        self.t = 0

    def predict(self) -> np.ndarray:
        """정규화된 누적 손실 최소화"""
        if self.t == 0:
            return np.zeros(self.dim)

        # x = argmin { η * <∑g_s, x> + R(x) }
        if self.regularizer == 'l2':
            # R(x) = ||x||^2 / 2
            x = -self.eta * self.cumulative_grad
        elif self.regularizer == 'entropy':
            # R(x) = ∑ x_i log(x_i) (simplex)
            logits = -self.eta * self.cumulative_grad
            x = np.exp(logits) / np.sum(np.exp(logits))

        return x

    def update(self, gradient: np.ndarray):
        self.t += 1
        self.cumulative_grad += gradient

3. AdaGrad (적응적 학습률)¶

class AdaGrad:
    """Adaptive Gradient Method for Online Learning"""

    def __init__(self, dim: int, eta: float = 1.0, epsilon: float = 1e-8):
        self.dim = dim
        self.eta = eta
        self.epsilon = epsilon
        self.x = np.zeros(dim)
        self.sum_squared_grads = np.zeros(dim)

    def predict(self) -> np.ndarray:
        return self.x.copy()

    def update(self, gradient: np.ndarray):
        # 그래디언트 제곱 누적
        self.sum_squared_grads += gradient ** 2

        # 적응적 학습률
        adaptive_eta = self.eta / (np.sqrt(self.sum_squared_grads) + self.epsilon)

        # 업데이트
        self.x = self.x - adaptive_eta * gradient

Regret Bound (data-dependent): $$ R_T \leq \frac{D}{\eta} \sqrt{\sum_{t=1}^{T} |g_t|_\infty^2} $$

→ 그래디언트가 희소하면 O(√T)보다 나음

NeurIPS 2025 핵심 기여¶

1. Optimal Dynamic Regret Bound¶

기존 결과: $$ R_T^{dyn} = O(\sqrt{T(1 + P_T)}) $$

여기서 $P_T = \sum_{t=1}^{T-1} \|x_t^* - x_{t+1}^*\|$ (path length)

새로운 결과: $$ R_T^{dyn} = O(\sqrt{T \cdot V_T}) $$

여기서 $V_T = \sum_{t=1}^{T-1} \sup_x |f_t(x) - f_{t+1}(x)|$ (functional variation)

→ 함수 자체의 변화가 작으면 더 나은 bound

2. Parameter-Free Online Learning¶

class ParameterFreeOGD:
    """학습률 튜닝 불필요한 OGD"""

    def __init__(self, dim: int, lip_const: float = 1.0):
        self.dim = dim
        self.G = lip_const
        self.x = np.zeros(dim)
        self.wealth = 0.0  # "wealth" 추적
        self.betting_fraction = 0.5
        self.cumulative_grad = np.zeros(dim)
        self.t = 0

    def predict(self) -> np.ndarray:
        return self.x.copy()

    def update(self, gradient: np.ndarray, loss: float):
        self.t += 1

        # Coin-betting framework
        # "wealth"는 누적 성능
        self.wealth = self.wealth - loss

        # 적응적 학습률 (wealth 기반)
        if self.t > 1:
            eta = abs(self.wealth) / (self.G * np.sqrt(self.t))
            eta = np.clip(eta, 1e-8, 1.0)
        else:
            eta = 1.0 / self.G

        self.cumulative_grad += gradient
        self.x = -eta * self.cumulative_grad

특징: - 학습률 η를 미리 설정할 필요 없음 - 자동으로 문제 난이도에 적응 - 최적 O(√T) regret 달성

3. Multi-Scale Online Learning¶

class MultiScaleOnlineLearner:
    """다중 스케일 비정상 환경 학습"""

    def __init__(self, dim: int, n_experts: int = 10):
        self.dim = dim
        self.n_experts = n_experts

        # 다양한 학습률의 전문가들
        self.experts = [
            OnlineGradientDescent(dim, eta=2**(-k))
            for k in range(n_experts)
        ]

        # 전문가 가중치 (Hedge)
        self.weights = np.ones(n_experts) / n_experts
        self.eta_hedge = 0.1

    def predict(self) -> np.ndarray:
        """가중 평균 예측"""
        predictions = np.array([e.predict() for e in self.experts])
        return np.average(predictions, axis=0, weights=self.weights)

    def update(self, gradient: np.ndarray, losses: np.ndarray):
        """
        gradient: 전체 그래디언트
        losses: 각 전문가의 손실
        """
        # 각 전문가 업데이트
        for expert in self.experts:
            expert.update(gradient)

        # Hedge로 가중치 업데이트
        self.weights *= np.exp(-self.eta_hedge * losses)
        self.weights /= self.weights.sum()

Regret Bound: $$ R_T = O\left(\sqrt{T \log T} + \sqrt{S_T \log T}\right) $$

여기서 $S_T$는 환경 변화 횟수 (switching number)

이론적 결과¶

Lower Bound¶

어떤 알고리즘도 다음보다 나을 수 없음:

설정	Lower Bound
Convex, Lipschitz	Ω(√T)
Strongly Convex	Ω(log T)
Exp-concave	Ω(d log T)
Non-stationary	Ω(√(T·V_T))

최적성¶

알고리즘	Upper Bound	Lower Bound	최적?
OGD	O(√T)	Ω(√T)	Yes
ONS	O(d log T)	Ω(d log T)	Yes
AdaGrad	O(√(∑\|g\|²))	-	Data-dep. opt.

응용¶

1. Online Portfolio Selection¶

class OnlinePortfolio:
    """온라인 포트폴리오 최적화"""

    def __init__(self, n_assets: int):
        self.n_assets = n_assets
        # Exponentiated Gradient
        self.weights = np.ones(n_assets) / n_assets
        self.eta = 0.1

    def predict(self) -> np.ndarray:
        """포트폴리오 가중치 반환"""
        return self.weights.copy()

    def update(self, returns: np.ndarray):
        """
        returns: 각 자산의 수익률
        """
        # Exponentiated gradient update
        # w_{t+1} ∝ w_t * exp(η * r_t)
        self.weights *= np.exp(self.eta * returns)
        self.weights /= self.weights.sum()  # Simplex 정규화

    def regret_vs_best_stock(self, all_returns: np.ndarray) -> float:
        """최고 단일 주식 대비 regret"""
        cumulative_returns = np.sum(all_returns, axis=0)
        best_stock_return = np.max(cumulative_returns)

        # 우리 전략의 수익
        portfolio_returns = []
        weights = np.ones(self.n_assets) / self.n_assets
        for t in range(len(all_returns)):
            portfolio_return = np.dot(weights, all_returns[t])
            portfolio_returns.append(portfolio_return)
            # 가중치 업데이트
            weights *= np.exp(self.eta * all_returns[t])
            weights /= weights.sum()

        our_return = np.sum(portfolio_returns)

        return best_stock_return - our_return

2. Online Convex Optimization for ML¶

class OnlineSGD:
    """온라인 학습으로서의 SGD"""

    def __init__(self, model, lr: float = 0.01):
        self.model = model
        self.lr = lr
        self.t = 0

    def step(self, x, y):
        """한 샘플로 업데이트"""
        self.t += 1

        # 예측
        pred = self.model(x)
        loss = self.loss_fn(pred, y)

        # 그래디언트
        loss.backward()

        # 적응적 학습률 (1/sqrt(t))
        lr = self.lr / np.sqrt(self.t)

        with torch.no_grad():
            for param in self.model.parameters():
                param -= lr * param.grad
                param.grad.zero_()

        return loss.item()

3. Adversarial Online Learning¶

class AdversarialBandit:
    """적대적 밴딧 문제"""

    def __init__(self, n_arms: int, gamma: float = 0.1):
        self.n_arms = n_arms
        self.gamma = gamma  # 탐색률
        self.weights = np.ones(n_arms)

    def select_arm(self) -> int:
        """Exp3 알고리즘"""
        # 혼합 분포
        probs = (1 - self.gamma) * (self.weights / self.weights.sum())
        probs += self.gamma / self.n_arms

        return np.random.choice(self.n_arms, p=probs)

    def update(self, arm: int, reward: float):
        """중요도 가중 업데이트"""
        prob = (1 - self.gamma) * (self.weights[arm] / self.weights.sum())
        prob += self.gamma / self.n_arms

        # 중요도 가중 보상 추정
        estimated_reward = reward / prob

        # 가중치 업데이트
        self.weights[arm] *= np.exp(self.gamma * estimated_reward / self.n_arms)

연결: Online Learning ↔ Deep Learning¶

Implicit Regularization¶

SGD의 암시적 정규화 효과를 Online Learning 관점에서 분석:

Online Learning 관점:
- 각 미니배치 = 하나의 라운드
- 손실 함수 = 배치 손실
- Regret = 일반화 오류

발견:
- SGD의 암시적 정규화 ≈ FTRL의 명시적 정규화
- 작은 학습률 → 강한 정규화
- 큰 배치 → 더 정확한 그래디언트 → 덜한 정규화

Neural Network의 Online Learning¶

def online_learning_view_of_training(model, data_stream):
    """신경망 학습의 Online Learning 관점"""

    optimizer = torch.optim.SGD(model.parameters(), lr=0.01)

    cumulative_loss = 0
    best_cumulative = float('inf')

    for t, (x, y) in enumerate(data_stream):
        # 예측 (결정 x_t)
        pred = model(x)

        # 손실 관측 (f_t 공개)
        loss = F.cross_entropy(pred, y)
        cumulative_loss += loss.item()

        # 업데이트
        optimizer.zero_grad()
        loss.backward()
        optimizer.step()

        # Regret 추적
        # (실제로는 최적 비교 필요)
        if t % 1000 == 0:
            avg_loss = cumulative_loss / (t + 1)
            print(f"Step {t}: Avg Loss = {avg_loss:.4f}")

요약¶

핵심 포인트¶

Online Learning: 순차적 의사결정의 수학적 프레임워크
Regret: 사후 최선 대비 성능 측정
최적 알고리즘: OGD (O(√T)), AdaGrad (data-dependent)
비정상 환경: Dynamic regret, multi-scale 학습

알고리즘 선택 가이드¶

상황	알고리즘
일반 볼록	OGD
희소 그래디언트	AdaGrad
강볼록	ONS
비정상	Adaptive restart, Multi-scale
학습률 모름	Parameter-free

참고 자료¶

NeurIPS 2025 Best Paper (Runner-up) - Online Learning Theory
Prediction, Learning, and Games - Cesa-Bianchi & Lugosi
Introduction to Online Convex Optimization - Hazan

마지막 업데이트: 2026-03-04