Score Matching¶

Score Matching은 정규화 상수(normalizing constant)를 계산하지 않고 확률 모델의 파라미터를 추정하는 기법이다. Diffusion Models과 Score-based Generative Models의 이론적 기반이 되며, 최근 생성 모델 연구의 핵심 개념으로 자리잡았다.

1. 개요¶

1.1 배경 및 동기¶

많은 확률 모델은 다음과 같은 형태의 비정규화 밀도(unnormalized density)로 표현된다:

p(x; theta) = exp(-E(x; theta)) / Z(theta)

여기서 Z(theta)는 정규화 상수(partition function)이다. 고차원 데이터에서 Z(theta)를 직접 계산하는 것은 계산적으로 불가능(intractable)하다. Score Matching은 이 문제를 우회하여 모델 파라미터를 추정한다.

1.2 Score Function 정의¶

Score function은 로그 확률 밀도의 그래디언트로 정의된다:

s(x) = nabla_x log p(x)

Score function의 핵심 특성: - 정규화 상수 Z에 의존하지 않음 (미분 시 상수항 소거) - 데이터 분포의 국소적 기하학적 정보를 인코딩 - 확률이 높은 영역을 가리키는 벡터장(vector field)으로 해석 가능

2. 이론적 기초¶

2.1 Explicit Score Matching (Hyvarinen, 2005)¶

항목	내용
논문	Estimation of Non-Normalized Statistical Models by Score Matching
출처	JMLR 2005
저자	Aapo Hyvarinen

Hyvarinen의 원래 목적 함수:

J(theta) = E_p_data [ 0.5 * ||s_model(x; theta)||^2 + tr(nabla_x s_model(x; theta)) ]

여기서: - s_model(x; theta) = nabla_x log p_model(x; theta): 모델의 score function - tr(nabla_x s_model): score function의 Jacobian trace

핵심 통찰: Fisher divergence를 최소화하는 것과 동등:

D_F(p_data || p_model) = E_p_data [ ||s_data(x) - s_model(x; theta)||^2 ]

2.2 Denoising Score Matching (Vincent, 2011)¶

항목	내용
논문	A Connection Between Score Matching and Denoising Autoencoders
출처	Neural Computation 2011
저자	Pascal Vincent

Explicit Score Matching의 Jacobian trace 계산은 고차원에서 비용이 크다. Denoising Score Matching (DSM)은 이를 회피한다.

아이디어: 깨끗한 데이터 x에 노이즈를 추가한 x_tilde = x + sigma * epsilon을 사용

J_DSM(theta) = E_x ~ p_data E_epsilon ~ N(0,I) [ ||s_model(x_tilde; theta) - nabla_x_tilde log p(x_tilde | x)||^2 ]

가우시안 노이즈의 경우:

nabla_x_tilde log p(x_tilde | x) = -(x_tilde - x) / sigma^2 = -epsilon / sigma

따라서 목적 함수는 다음과 같이 단순화된다:

J_DSM(theta) = E [ ||s_model(x + sigma * epsilon; theta) + epsilon / sigma||^2 ]

2.3 Sliced Score Matching (Song et al., 2020)¶

항목	내용
논문	Sliced Score Matching: A Scalable Approach to Density and Score Estimation
출처	UAI 2020
저자	Yang Song, Sahaj Garg, Jiaxin Shi, Stefano Ermon

Jacobian trace 계산을 랜덤 프로젝션으로 근사:

J_SSM(theta) = E_v E_x [ v^T nabla_x s_model(x; theta) v + 0.5 * (v^T s_model(x; theta))^2 ]

여기서 v는 단위 구면에서 균등하게 샘플링된 랜덤 벡터이다.

장점: - Jacobian 전체 계산 불필요 - 벡터-Jacobian 곱 한 번으로 계산 가능 (자동 미분 활용) - 노이즈 추가 없이 원본 데이터로 학습 가능

3. Score-based Generative Models¶

3.1 NCSN (Noise Conditional Score Networks)¶

항목	내용
논문	Generative Modeling by Estimating Gradients of the Data Distribution
출처	NeurIPS 2019 (Oral)
저자	Yang Song, Stefano Ermon
코드	github.com/ermongroup/ncsn

핵심 아이디어: 1. 여러 노이즈 레벨 {sigma_1, sigma_2, ..., sigma_L}에서 score function 학습 2. 학습된 score를 사용한 Annealed Langevin Dynamics로 샘플 생성

Langevin Dynamics 업데이트:

x_{t+1} = x_t + (epsilon / 2) * s_model(x_t) + sqrt(epsilon) * z_t

여기서 z_t ~ N(0, I)이고 epsilon은 step size이다.

3.2 Score SDE (Song et al., 2021)¶

항목	내용
논문	Score-Based Generative Modeling through Stochastic Differential Equations
출처	ICLR 2021 (Outstanding Paper Award)
저자	Yang Song, Jascha Sohl-Dickstein, Diederik Kingma, Abhishek Kumar, Stefano Ermon, Ben Poole

Diffusion 과정을 연속 시간 SDE로 일반화:

Forward SDE:

dx = f(x, t) dt + g(t) dw

Reverse SDE:

dx = [f(x, t) - g(t)^2 nabla_x log p_t(x)] dt + g(t) dw_bar

통합된 프레임워크: - DDPM (Variance Preserving SDE) - SMLD (Variance Exploding SDE) - 새로운 SDE 설계 가능

4. 최신 연구 동향¶

4.1 Score Matching with Missing Data (ICML 2025)¶

항목	내용
논문	Score Matching with Missing Data
출처	ICML 2025 (Outstanding Paper Award)
저자	Josh Givens, Song Liu, Henry W. J. Reeve
소속	University of Bristol

기여: - 결측 데이터가 있는 상황에서 score matching을 적용하는 두 가지 방법 제안 - MCAR(Missing Completely At Random), MAR(Missing At Random) 조건에서의 이론적 보장 - 결측 데이터 대치(imputation)와 밀도 추정을 동시에 수행

4.2 관련 발전 방향¶

연구 방향	설명
Conditional Score Matching	조건부 생성을 위한 score 학습
Guided Diffusion	Classifier/Classifier-free guidance
Consistency Models	One-step 생성을 위한 score 증류
Flow Matching	Score matching의 ODE 기반 확장

5. Python 구현 예시¶

5.1 Denoising Score Matching 기본 구현¶

import torch
import torch.nn as nn


class ScoreNetwork(nn.Module):
    """Score function을 근사하는 신경망."""

    def __init__(self, input_dim: int, hidden_dim: int = 128):
        super().__init__()
        self.net = nn.Sequential(
            nn.Linear(input_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def forward(self, x: torch.Tensor) -> torch.Tensor:
        return self.net(x)


def denoising_score_matching_loss(
    score_net: nn.Module,
    x: torch.Tensor,
    sigma: float = 0.1,
) -> torch.Tensor:
    """
    Denoising Score Matching 손실 함수.

    Args:
        score_net: Score function을 근사하는 네트워크
        x: 원본 데이터 배치 [B, D]
        sigma: 노이즈 표준편차

    Returns:
        DSM 손실값
    """
    # 노이즈 추가
    noise = torch.randn_like(x)
    x_noisy = x + sigma * noise

    # Score 예측
    score_pred = score_net(x_noisy)

    # 타겟 score: -noise / sigma
    score_target = -noise / sigma

    # MSE 손실
    loss = ((score_pred - score_target) ** 2).sum(dim=-1).mean()
    return loss


# 학습 예시
def train_score_model(
    data: torch.Tensor,
    epochs: int = 1000,
    sigma: float = 0.1,
    lr: float = 1e-3,
) -> ScoreNetwork:
    """Score network 학습."""
    input_dim = data.shape[1]
    model = ScoreNetwork(input_dim)
    optimizer = torch.optim.Adam(model.parameters(), lr=lr)

    for epoch in range(epochs):
        optimizer.zero_grad()
        loss = denoising_score_matching_loss(model, data, sigma)
        loss.backward()
        optimizer.step()

        if (epoch + 1) % 100 == 0:
            print(f"Epoch {epoch + 1}, Loss: {loss.item():.4f}")

    return model

5.2 Langevin Dynamics 샘플링¶

@torch.no_grad()
def langevin_dynamics(
    score_net: nn.Module,
    initial_samples: torch.Tensor,
    sigma: float = 0.1,
    step_size: float = 0.01,
    n_steps: int = 1000,
) -> torch.Tensor:
    """
    Langevin Dynamics를 사용한 샘플 생성.

    Args:
        score_net: 학습된 score network
        initial_samples: 초기 샘플 (보통 노이즈)
        sigma: 학습 시 사용한 노이즈 레벨
        step_size: Langevin step size
        n_steps: 샘플링 스텝 수

    Returns:
        생성된 샘플
    """
    x = initial_samples.clone()

    for _ in range(n_steps):
        score = score_net(x)
        noise = torch.randn_like(x)
        x = x + (step_size / 2) * score + torch.sqrt(torch.tensor(step_size)) * noise

    return x


# 사용 예시
if __name__ == "__main__":
    # 2D Gaussian mixture 데이터 생성
    n_samples = 1000
    centers = torch.tensor([[2.0, 2.0], [-2.0, -2.0], [2.0, -2.0], [-2.0, 2.0]])
    labels = torch.randint(0, 4, (n_samples,))
    data = centers[labels] + 0.3 * torch.randn(n_samples, 2)

    # Score network 학습
    model = train_score_model(data, epochs=2000, sigma=0.5)

    # 샘플 생성
    initial = torch.randn(500, 2) * 3
    samples = langevin_dynamics(model, initial, sigma=0.5, n_steps=500)
    print(f"Generated {samples.shape[0]} samples")

5.3 Multi-scale Score Matching (NCSN 스타일)¶

class NCSNScoreNetwork(nn.Module):
    """노이즈 레벨을 조건으로 받는 Score Network."""

    def __init__(self, input_dim: int, hidden_dim: int = 256, n_sigmas: int = 10):
        super().__init__()
        self.n_sigmas = n_sigmas

        # Sigma embedding
        self.sigma_embed = nn.Embedding(n_sigmas, hidden_dim)

        self.net = nn.Sequential(
            nn.Linear(input_dim + hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, hidden_dim),
            nn.SiLU(),
            nn.Linear(hidden_dim, input_dim),
        )

    def forward(self, x: torch.Tensor, sigma_idx: torch.Tensor) -> torch.Tensor:
        sigma_emb = self.sigma_embed(sigma_idx)
        h = torch.cat([x, sigma_emb], dim=-1)
        return self.net(h)


def ncsn_loss(
    model: NCSNScoreNetwork,
    x: torch.Tensor,
    sigmas: torch.Tensor,
) -> torch.Tensor:
    """
    NCSN 스타일 multi-scale score matching 손실.

    Args:
        model: Noise-conditional score network
        x: 데이터 배치 [B, D]
        sigmas: 사용할 노이즈 레벨 리스트

    Returns:
        가중 평균 손실
    """
    batch_size = x.shape[0]
    n_sigmas = len(sigmas)

    # 랜덤 노이즈 레벨 선택
    sigma_idx = torch.randint(0, n_sigmas, (batch_size,))
    sigma = sigmas[sigma_idx].view(-1, 1)

    # 노이즈 추가
    noise = torch.randn_like(x)
    x_noisy = x + sigma * noise

    # Score 예측 및 손실 계산
    score_pred = model(x_noisy, sigma_idx)
    score_target = -noise / sigma

    # sigma^2로 가중치 부여 (낮은 노이즈에서 더 중요)
    loss = (sigma ** 2) * ((score_pred - score_target) ** 2).sum(dim=-1)
    return loss.mean()

6. 실무 적용 가이드¶

6.1 하이퍼파라미터 선택¶

파라미터	권장 범위	비고
sigma (단일 스케일)	0.1 - 1.0	데이터 스케일에 따라 조정
sigma_min (다중 스케일)	0.01	데이터의 세부 구조 포착
sigma_max (다중 스케일)	데이터 범위의 1-2배	전역 구조 학습
n_sigmas	10 - 100	로그 스케일로 분포
Langevin steps	100 - 1000	품질과 속도 트레이드오프
step_size	sigma^2 / 10	안정적 수렴 위해 작게 설정

6.2 주요 고려사항¶

학습 안정성: - 노이즈 레벨이 너무 작으면 score 추정이 불안정 - 노이즈 레벨이 너무 크면 데이터 구조 정보 손실 - Exponential Moving Average (EMA) 사용 권장

샘플링 품질: - Annealed Langevin dynamics: 큰 sigma에서 작은 sigma로 점진적 감소 - Predictor-Corrector 샘플러 사용 고려 - 충분한 step 수 확보

7. 관련 기법 비교¶

기법	장점	단점
Explicit SM	이론적 정확성	Jacobian 계산 비용
Denoising SM	효율적, 구현 용이	노이즈 레벨 선택 필요
Sliced SM	노이즈 불필요, 확장성	분산 증가 가능
Flow Matching	직선 경로, 빠른 샘플링	ODE solver 필요

8. 참고 문헌¶

핵심 논문¶

Hyvarinen, A. (2005). Estimation of Non-Normalized Statistical Models by Score Matching. JMLR, 6, 695-709.
Vincent, P. (2011). A Connection Between Score Matching and Denoising Autoencoders. Neural Computation, 23(7), 1661-1674.
Song, Y., & Ermon, S. (2019). Generative Modeling by Estimating Gradients of the Data Distribution. NeurIPS 2019.
Song, Y., et al. (2020). Improved Techniques for Training Score-Based Generative Models. NeurIPS 2020.
Song, Y., et al. (2021). Score-Based Generative Modeling through Stochastic Differential Equations. ICLR 2021.
Givens, J., Liu, S., & Reeve, H. (2025). Score Matching with Missing Data. ICML 2025 (Outstanding Paper Award).

구현 및 튜토리얼¶

Yang Song's Blog: yang-song.net/blog/2021/score/
ermongroup/ncsn: github.com/ermongroup/ncsn
Physics-based Deep Learning: physicsbaseddeeplearning.org/probmodels-score.html