준지도학습 개요¶

준지도학습은 소량의 레이블된 데이터 $\mathcal{D}_l = \{(x_i, y_i)\}_{i=1}^l$와 대량의 레이블되지 않은 데이터 $\mathcal{D}_u = \{x_j\}_{j=1}^u$ (일반적으로 $u \gg l$)를 함께 활용하여 학습하는 패러다임.

이론적 기초¶

학습 패러다임 비교¶

패러다임	레이블 데이터	비레이블 데이터	학습 목표
지도학습	전체	없음	$P(Y
비지도학습	없음	전체	$P(X)$ 학습
준지도학습	소량	대량	$P(Y
자기지도학습	없음	전체	표현 학습 후 Fine-tuning

핵심 가정 (Assumptions)¶

준지도학습이 효과적이려면 데이터에 대한 특정 가정이 필요하다:

1. Smoothness Assumption¶

가까운 점들은 같은 레이블을 가질 가능성이 높음.

\[\text{If } x_1 \approx x_2 \text{ then } y_1 \approx y_2\]

활용: Label Propagation, Graph-based methods

2. Cluster Assumption¶

같은 클러스터에 속한 점들은 같은 레이블을 공유할 가능성이 높음.

\[\text{If } x_1, x_2 \in \text{same cluster} \text{ then } P(y_1 = y_2) \text{ is high}\]

결과: 결정 경계는 저밀도 영역을 지나야 함 (Low-density Separation)

3. Manifold Assumption¶

고차원 데이터가 저차원 매니폴드에 놓여 있음.

\[x \in \mathcal{M} \subset \mathbb{R}^D, \quad \dim(\mathcal{M}) \ll D\]

활용: 매니폴드 상에서 거리 계산, 차원 축소 후 학습

참고 논문: - Chapelle, O., Scholkopf, B., & Zien, A. (2006). "Semi-Supervised Learning". MIT Press. - Zhu, X. (2005). "Semi-Supervised Learning Literature Survey". Technical Report, UW-Madison.

알고리즘 분류 체계¶

Semi-supervised Learning
├── Self-training Methods
│   ├── Pseudo-labeling
│   ├── Noisy Student
│   └── Meta Pseudo Labels
├── Co-training
│   ├── Multi-view Co-training
│   └── Co-regularization
├── Graph-based Methods
│   ├── Label Propagation
│   ├── Label Spreading
│   └── Graph Neural Networks
├── Generative Models
│   ├── Semi-supervised VAE
│   └── Semi-supervised GANs
├── Consistency Regularization
│   ├── Pi-Model
│   ├── Temporal Ensembling
│   ├── Mean Teacher
│   ├── Virtual Adversarial Training (VAT)
│   ├── UDA (Unsupervised Data Augmentation)
│   ├── MixMatch
│   ├── FixMatch
│   └── FlexMatch
├── Entropy Minimization
│   └── Entropy-based methods
└── PU Learning (Positive-Unlabeled)
    ├── Two-step approach
    ├── Cost-sensitive learning
    └── Unbiased estimators

Self-training Methods¶

Pseudo-labeling¶

가장 간단한 준지도학습 기법으로, 신뢰도 높은 예측을 레이블로 사용:

알고리즘:

1. 레이블 데이터로 초기 모델 f_0 학습
2. for t = 1 to T:
   a. 비레이블 데이터에 대해 예측: p = f_{t-1}(x_u)
   b. 신뢰도 높은 예측 선택:
      D_pseudo = {(x, argmax(p)) : max(p) > threshold}
   c. D_l ∪ D_pseudo로 새 모델 f_t 학습
3. Return f_T

손실 함수:

\[\mathcal{L} = \frac{1}{|D_l|}\sum_{(x,y) \in D_l} \mathcal{L}_{CE}(f(x), y) + \lambda \frac{1}{|D_u|}\sum_{x \in D_u} \mathbf{1}[\max(f(x)) > \tau] \cdot \mathcal{L}_{CE}(f(x), \hat{y})\]

여기서 $\hat{y} = \arg\max f(x)$, $\tau$는 신뢰도 임계값.

한계: - Confirmation bias: 초기 오류가 누적 - Class imbalance 악화 가능

참고 논문: - Lee, D.H. (2013). "Pseudo-Label: The Simple and Efficient Semi-Supervised Learning Method for Deep Neural Networks". ICML Workshop.

Noisy Student Training¶

대규모 비레이블 데이터를 활용한 Self-training의 발전:

핵심 아이디어: 1. Teacher 모델로 pseudo-label 생성 2. Student 모델은 더 크고, 노이즈(augmentation, dropout) 적용 3. Student가 Teacher보다 좋아지면 역할 교체

ImageNet 결과: EfficientNet-L2 with Noisy Student가 SOTA 달성 (88.4% top-1)

참고 논문: - Xie, Q. et al. (2020). "Self-Training With Noisy Student Improves ImageNet Classification". CVPR.

Co-training¶

Multi-view Co-training¶

서로 다른 뷰(feature subset)로 두 분류기를 학습하고, 상호 레이블 제공:

가정: - 두 뷰가 조건부 독립: $P(X_1, X_2 | Y) = P(X_1 | Y) P(X_2 | Y)$ - 각 뷰가 개별적으로 충분한 정보 포함

알고리즘:

1. Initialize: 각 뷰에 대한 분류기 f_1, f_2 학습 (레이블 데이터로)
2. Repeat:
   a. f_1이 높은 신뢰도로 예측한 샘플을 f_2의 학습 데이터에 추가
   b. f_2가 높은 신뢰도로 예측한 샘플을 f_1의 학습 데이터에 추가
   c. 두 분류기 재학습
3. 최종 예측: f_1, f_2의 앙상블

참고 논문: - Blum, A. & Mitchell, T. (1998). "Combining Labeled and Unlabeled Data with Co-Training". COLT.

Graph-based Methods¶

Label Propagation¶

그래프 구조를 통해 레이블을 전파:

그래프 구성: - 노드: 모든 데이터 포인트 (레이블 + 비레이블) - 엣지 가중치: 유사도 (예: RBF 커널)

\[W_{ij} = \exp\left(-\frac{\|x_i - x_j\|^2}{2\sigma^2}\right)\]

전파 방정식:

\[Y^{(t+1)} = \alpha S Y^{(t)} + (1-\alpha) Y^{(0)}\]

여기서: - $S = D^{-1/2} W D^{-1/2}$ (정규화된 인접 행렬) - $Y^{(0)}$: 초기 레이블 행렬 (알려진 레이블만) - $\alpha$: 전파 비율

수렴 해:

\[Y^* = (I - \alpha S)^{-1} (1-\alpha) Y^{(0)}\]

참고 논문: - Zhu, X. & Ghahramani, Z. (2002). "Learning from Labeled and Unlabeled Data with Label Propagation". Technical Report, CMU.

Label Spreading¶

Label Propagation의 정규화 버전으로 노이즈에 더 강건:

\[\mathcal{L}(Y) = \frac{1}{2}\sum_{i,j} W_{ij}\left\|\frac{Y_i}{\sqrt{D_{ii}}} - \frac{Y_j}{\sqrt{D_{jj}}}\right\|^2 + \mu \sum_i \|Y_i - Y_i^{(0)}\|^2\]

참고 논문: - Zhou, D. et al. (2004). "Learning with Local and Global Consistency". NeurIPS.

Consistency Regularization¶

핵심 아이디어¶

데이터에 perturbation을 가해도 예측이 일관되어야 한다:

\[\mathcal{L}_{consistency} = d(f(x), f(\tilde{x}))\]

여기서 $\tilde{x}$는 $x$의 변형 (augmentation, noise, dropout 등).

Pi-Model¶

\[\mathcal{L} = \mathcal{L}_{supervised} + \lambda \mathbb{E}_{x \sim D_u}\left[\|f(x; \theta) - f(\tilde{x}; \theta)\|^2\right]\]

참고 논문: - Laine, S. & Aila, T. (2017). "Temporal Ensembling for Semi-Supervised Learning". ICLR.

Mean Teacher¶

Student와 Teacher 모델을 사용하고, Teacher는 Student의 EMA (Exponential Moving Average):

\[\theta_{teacher}^{(t)} = \alpha \theta_{teacher}^{(t-1)} + (1-\alpha) \theta_{student}^{(t)}\]

Consistency loss:

\[\mathcal{L}_{consistency} = \mathbb{E}[\|f_{student}(x) - f_{teacher}(\tilde{x})\|^2]\]

참고 논문: - Tarvainen, A. & Valpola, H. (2017). "Mean Teachers are Better Role Models". NeurIPS.

Virtual Adversarial Training (VAT)¶

가장 불안정한 방향으로 perturbation하여 일관성 강화:

\[r_{adv} = \arg\max_{\|r\| \leq \epsilon} D_{KL}(p(y|x) \| p(y|x+r))\]

\[\mathcal{L}_{VAT} = D_{KL}(p(y|x) \| p(y|x+r_{adv}))\]

참고 논문: - Miyato, T. et al. (2018). "Virtual Adversarial Training". IEEE TPAMI.

MixMatch¶

여러 기법을 통합한 holistic 접근:

Augmentation: 각 비레이블 샘플에 K번 augmentation
Label Guessing: Augmented 샘플들의 예측 평균
Sharpening: 예측 분포를 날카롭게 (temperature scaling)
MixUp: 레이블/비레이블 데이터를 MixUp

\[\tilde{y} = \text{Sharpen}\left(\frac{1}{K}\sum_{k=1}^{K} f(Aug_k(x))\right)\]

참고 논문: - Berthelot, D. et al. (2019). "MixMatch: A Holistic Approach to Semi-Supervised Learning". NeurIPS.

FixMatch¶

단순하지만 강력한 접근법:

Weak augmentation으로 pseudo-label 생성
Strong augmentation에 대해 일관성 유지
신뢰도 높은 pseudo-label만 사용

\[\mathcal{L}_u = \frac{1}{\mu B}\sum_{b=1}^{\mu B} \mathbf{1}[\max(p_b) \geq \tau] H(\hat{q}_b, p_b^s)\]

여기서: - $p_b = f(\text{WeakAug}(u_b))$: Weak augmentation 예측 - $p_b^s = f(\text{StrongAug}(u_b))$: Strong augmentation 예측 - $\hat{q}_b = \arg\max(p_b)$: One-hot pseudo-label

참고 논문: - Sohn, K. et al. (2020). "FixMatch: Simplifying Semi-Supervised Learning with Consistency and Confidence". NeurIPS.

FlexMatch¶

FixMatch에 클래스별 적응형 임계값 적용:

\[\tau_c(t) = \tau \cdot \frac{\sigma_c(t)}{\max_c \sigma_c(t)}\]

여기서 $\sigma_c(t)$는 클래스 $c$의 학습 상태를 나타내는 지표.

참고 논문: - Zhang, B. et al. (2021). "FlexMatch: Boosting Semi-Supervised Learning with Curriculum Pseudo Labeling". NeurIPS.

PU Learning (Positive-Unlabeled Learning)¶

문제 정의¶

Positive 샘플 $\mathcal{P}$와 Unlabeled 샘플 $\mathcal{U}$ (Positive + Negative 혼합)만 있는 상황.

응용 분야: - 이상 탐지 (이상치만 레이블) - 질병 진단 (확진자만 레이블) - 추천 시스템 (클릭만 관측) - 문서 분류 (관련 문서만 레이블)

Selected Completely At Random (SCAR) 가정¶

Positive 중에서 레이블되는 확률이 특성과 무관:

\[P(s=1|x, y=1) = P(s=1|y=1) = c\]

이 가정 하에서:

\[P(y=1|x) = \frac{P(s=1|x)}{c}\]

Two-step Approach¶

Step 1: Reliable Negative 식별 - Spy technique: Positive 일부를 Unlabeled에 섞어 경계 학습 - 1-DNF: 각 Positive를 포함하지 않는 Negative 후보 수집 - Cosine similarity: Positive와 거리가 먼 샘플

Step 2: 표준 분류기 학습 - Positive + Reliable Negative로 학습

Unbiased PU Learning¶

비편향 위험 추정량:

\[\hat{R}_{pu}(f) = \pi_p \hat{R}_p^+(f) + \hat{R}_u^-(f) - \pi_p \hat{R}_p^-(f)\]

여기서: - $\pi_p = P(Y=1)$: 클래스 사전 확률 - $\hat{R}_p^+(f)$: Positive 데이터에서 양성 예측 손실 - $\hat{R}_u^-(f)$: Unlabeled 데이터에서 음성 예측 손실 - $\hat{R}_p^-(f)$: Positive 데이터에서 음성 예측 손실

non-negative PU (nnPU):

음수 위험을 방지하기 위해:

\[\tilde{R}_{pu}(f) = \pi_p \hat{R}_p^+(f) + \max(0, \hat{R}_u^-(f) - \pi_p \hat{R}_p^-(f))\]

참고 논문: - Elkan, C. & Noto, K. (2008). "Learning Classifiers from Only Positive and Unlabeled Data". KDD. - Kiryo, R. et al. (2017). "Positive-Unlabeled Learning with Non-Negative Risk Estimator". NeurIPS.

실무 적용 가이드¶

언제 준지도학습을 사용하나?¶

상황	권장 여부	이유
레이블 비용이 매우 높음	강력 권장	레이블링 비용 절감
비레이블 데이터가 풍부함	권장	추가 정보 활용
클래스가 명확히 구분됨	권장	Cluster assumption 충족
데이터 노이즈가 심함	주의	Confirmation bias 위험
클래스 불균형이 심함	주의	Pseudo-label 편향

Pseudo-labeling 구현¶

import numpy as np
import torch
import torch.nn as nn
import torch.optim as optim
from torch.utils.data import DataLoader, TensorDataset

class PseudoLabelTrainer:
    def __init__(self, model, threshold=0.95, lambda_u=1.0):
        self.model = model
        self.threshold = threshold
        self.lambda_u = lambda_u
        self.criterion = nn.CrossEntropyLoss(reduction='none')

    def train_epoch(self, labeled_loader, unlabeled_loader, optimizer):
        self.model.train()
        total_loss = 0

        for (x_l, y_l), x_u in zip(labeled_loader, unlabeled_loader):
            x_l, y_l = x_l.cuda(), y_l.cuda()
            x_u = x_u[0].cuda() if isinstance(x_u, tuple) else x_u.cuda()

            # Labeled loss
            logits_l = self.model(x_l)
            loss_l = self.criterion(logits_l, y_l).mean()

            # Pseudo-labeling
            with torch.no_grad():
                logits_u = self.model(x_u)
                probs_u = torch.softmax(logits_u, dim=1)
                max_probs, pseudo_labels = probs_u.max(dim=1)
                mask = max_probs > self.threshold

            # Unlabeled loss (only high confidence)
            if mask.sum() > 0:
                logits_u_masked = self.model(x_u[mask])
                loss_u = self.criterion(logits_u_masked, pseudo_labels[mask]).mean()
            else:
                loss_u = 0

            # Total loss
            loss = loss_l + self.lambda_u * loss_u

            optimizer.zero_grad()
            loss.backward()
            optimizer.step()

            total_loss += loss.item()

        return total_loss / len(labeled_loader)

FixMatch 스타일 구현¶

import torchvision.transforms as T
from torch.cuda.amp import autocast, GradScaler

class FixMatchTrainer:
    def __init__(self, model, threshold=0.95, lambda_u=1.0, T_sharp=0.5):
        self.model = model
        self.threshold = threshold
        self.lambda_u = lambda_u
        self.T = T_sharp

        # Augmentations
        self.weak_aug = T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomCrop(32, padding=4),
        ])

        self.strong_aug = T.Compose([
            T.RandomHorizontalFlip(),
            T.RandomCrop(32, padding=4),
            T.RandAugment(n=2, m=10),  # RandAugment for strong aug
        ])

    def train_step(self, x_l, y_l, x_u, optimizer, scaler):
        self.model.train()

        # Weak and strong augmentation for unlabeled
        x_u_weak = self.weak_aug(x_u)
        x_u_strong = self.strong_aug(x_u)

        with autocast():
            # Labeled loss
            logits_l = self.model(x_l)
            loss_l = F.cross_entropy(logits_l, y_l)

            # Pseudo-labels from weak augmentation
            with torch.no_grad():
                logits_u_weak = self.model(x_u_weak)
                probs_u = torch.softmax(logits_u_weak / self.T, dim=1)
                max_probs, pseudo_labels = probs_u.max(dim=1)
                mask = max_probs > self.threshold

            # Consistency loss on strong augmentation
            if mask.sum() > 0:
                logits_u_strong = self.model(x_u_strong)
                loss_u = F.cross_entropy(
                    logits_u_strong[mask], 
                    pseudo_labels[mask],
                    reduction='mean'
                )
            else:
                loss_u = torch.tensor(0.0).cuda()

            loss = loss_l + self.lambda_u * loss_u

        optimizer.zero_grad()
        scaler.scale(loss).backward()
        scaler.step(optimizer)
        scaler.update()

        return loss_l.item(), loss_u.item(), mask.float().mean().item()

Label Propagation 사용¶

from sklearn.semi_supervised import LabelPropagation, LabelSpreading
from sklearn.preprocessing import StandardScaler
import numpy as np

# 데이터 준비
X_labeled = ...  # (n_labeled, n_features)
y_labeled = ...  # (n_labeled,)
X_unlabeled = ...  # (n_unlabeled, n_features)

# 전체 데이터 결합
X_all = np.vstack([X_labeled, X_unlabeled])
y_all = np.hstack([y_labeled, np.full(len(X_unlabeled), -1)])  # -1은 unlabeled

# 스케일링
scaler = StandardScaler()
X_all_scaled = scaler.fit_transform(X_all)

# Label Propagation
lp = LabelPropagation(
    kernel='rbf',
    gamma=20,  # RBF 커널 파라미터
    n_neighbors=7,  # knn 커널의 경우
    max_iter=1000,
    tol=1e-3
)
lp.fit(X_all_scaled, y_all)

# 전파된 레이블 확인
propagated_labels = lp.transduction_
label_distributions = lp.label_distributions_

# 신뢰도 확인
confidence = label_distributions.max(axis=1)
print(f"Average confidence: {confidence.mean():.4f}")
print(f"Min confidence: {confidence.min():.4f}")

# Label Spreading (노이즈에 더 강건)
ls = LabelSpreading(
    kernel='rbf',
    gamma=20,
    alpha=0.2,  # clamping factor
    max_iter=30
)
ls.fit(X_all_scaled, y_all)

PU Learning 구현¶

from sklearn.base import BaseEstimator, ClassifierMixin
import numpy as np

class PUClassifier(BaseEstimator, ClassifierMixin):
    """
    Positive-Unlabeled Learning with Unbiased Risk Estimator
    """
    def __init__(self, base_estimator, prior=None, n_iterations=10):
        self.base_estimator = base_estimator
        self.prior = prior  # P(Y=1)
        self.n_iterations = n_iterations

    def fit(self, X_positive, X_unlabeled, prior=None):
        """
        X_positive: Positive samples
        X_unlabeled: Unlabeled samples (mix of P and N)
        prior: P(Y=1), if None, estimate from data
        """
        n_p, n_u = len(X_positive), len(X_unlabeled)

        # Estimate prior if not given
        if prior is None:
            self.prior_ = self._estimate_prior(X_positive, X_unlabeled)
        else:
            self.prior_ = prior

        # Combine data
        X = np.vstack([X_positive, X_unlabeled])
        y = np.hstack([np.ones(n_p), np.zeros(n_u)])

        # Iterative training with cost-sensitive learning
        for i in range(self.n_iterations):
            # Predict on unlabeled
            if i > 0:
                probs = self.base_estimator.predict_proba(X_unlabeled)[:, 1]
                # Weight unlabeled samples
                w_u = (1 - self.prior_) / (1 - probs + 1e-8)
                w_u = np.clip(w_u, 0.1, 10)
            else:
                w_u = np.ones(n_u)

            sample_weight = np.hstack([np.ones(n_p), w_u])
            self.base_estimator.fit(X, y, sample_weight=sample_weight)

        return self

    def _estimate_prior(self, X_positive, X_unlabeled):
        """Estimate class prior using e1 method"""
        # Simple estimation: P(Y=1) ≈ n_p / (n_p + n_u)
        # More sophisticated methods exist (e.g., mixture proportion estimation)
        return len(X_positive) / (len(X_positive) + len(X_unlabeled) * 0.5)

    def predict_proba(self, X):
        return self.base_estimator.predict_proba(X)

    def predict(self, X):
        return self.base_estimator.predict(X)

하위 문서¶

주제	설명	링크
PU Learning	Positive-Unlabeled 학습 상세	pu-learning.md

참고 문헌¶

교과서¶

Chapelle, O., Scholkopf, B., & Zien, A. (2006). "Semi-Supervised Learning". MIT Press.
Zhu, X. & Goldberg, A.B. (2009). "Introduction to Semi-Supervised Learning". Morgan & Claypool.

핵심 논문¶

Foundational: - Zhu, X. (2005). "Semi-Supervised Learning Literature Survey". Technical Report, UW-Madison. - Blum, A. & Mitchell, T. (1998). "Combining Labeled and Unlabeled Data with Co-Training". COLT.

Self-training: - Lee, D.H. (2013). "Pseudo-Label". ICML Workshop. - Xie, Q. et al. (2020). "Noisy Student". CVPR.

Consistency Regularization: - Tarvainen, A. & Valpola, H. (2017). "Mean Teacher". NeurIPS. - Miyato, T. et al. (2018). "VAT". IEEE TPAMI. - Berthelot, D. et al. (2019). "MixMatch". NeurIPS. - Sohn, K. et al. (2020). "FixMatch". NeurIPS. - Zhang, B. et al. (2021). "FlexMatch". NeurIPS.

PU Learning: - Elkan, C. & Noto, K. (2008). "Learning Classifiers from Only Positive and Unlabeled Data". KDD. - Kiryo, R. et al. (2017). "Positive-Unlabeled Learning with Non-Negative Risk Estimator". NeurIPS.

구현 및 라이브러리¶

USB (Unified Semi-supervised Learning Benchmark): https://github.com/microsoft/Semi-supervised-learning
TorchSSL: https://github.com/TorchSSL/TorchSSL
scikit-learn Semi-supervised: https://scikit-learn.org/stable/modules/semi_supervised.html