PU Learning (Positive-Unlabeled Learning)¶

논문 정보¶

항목	내용
제목	Learning Classifiers from Only Positive and Unlabeled Data
저자	Charles Elkan, Keith Noto
학회/저널	ACM SIGKDD International Conference on Knowledge Discovery and Data Mining (KDD)
연도	2008
링크	https://cseweb.ucsd.edu/~elkan/posonly.pdf

개요¶

문제 정의¶

Positive와 Unlabeled 데이터만 존재하는 상황에서의 이진 분류 문제:

Positive 샘플: 확실히 양성인 레이블된 데이터
Unlabeled 샘플: 양성 또는 음성일 수 있는 레이블 없는 데이터
Negative 샘플: 존재하지 않음

실제 응용 사례: - 질병 진단: 확진 환자(P)와 미검사 환자(U) - 추천 시스템: 구매/클릭(P)과 미노출 상품(U) - 텍스트 분류: 관련 문서(P)와 미분류 문서(U)

핵심 아이디어¶

Selected Completely At Random (SCAR) 가정 하에서:

레이블된 양성 샘플은 전체 양성 중에서 무작위로 선택됨
분류기가 출력하는 확률 g(x) = P(s=1|x)는 실제 확률 f(x) = P(y=1|x)와 상수 배 관계
이 상수(label frequency)를 추정하면 올바른 확률 추정 가능

수식으로:

f(x) = g(x) / c

where c = P(s=1|y=1) (label frequency)

알고리즘/방법론¶

SCAR 가정 하의 확률 관계¶

레이블 s와 실제 클래스 y의 관계:

P(s=1|x) = P(s=1|y=1, x) * P(y=1|x)
         = P(s=1|y=1) * P(y=1|x)      (SCAR 가정)
         = c * f(x)

따라서:
f(x) = P(s=1|x) / c = g(x) / c

Label Frequency c 추정 방법¶

방법 1: Hold-out 추정

c = (1/|P|) * sum_{x in P} g(x)

where P = labeled positive set

방법 2: e1 추정자

c = E[g(x) | y=1]

레이블된 양성 데이터의 평균 예측 확률.

방법 3: e2, e3 추정자

e2: max(g(x)) for x in P
e3: g(x)가 특정 임계값 이상인 비율

단계별 알고리즘¶

Algorithm: Two-Step PU Learning

Input: Positive set P, Unlabeled set U
Output: Classifier f(x) = P(y=1|x)

Step 1: Train initial classifier
   - Treat P as positive, U as negative
   - Train classifier g(x) = P(s=1|x)

Step 2: Estimate label frequency c
   - Option A (hold-out): c = mean(g(x)) for x in validation P
   - Option B (direct): c = mean(g(x)) for x in training P

Step 3: Calibrate probabilities
   - f(x) = g(x) / c
   - Clip to [0, 1] if necessary

Step 4: (Optional) Retrain with weights
   - For x in P: weight = 1
   - For x in U: weight = (1 - f(x)) / f(x) for negative contribution

가중치 기반 재학습¶

Unlabeled 샘플을 확률적으로 양성/음성으로 분리:

For each unlabeled example x:
   - Weight as positive: f(x)
   - Weight as negative: 1 - f(x)

Loss function:
L = sum_{x in P} loss(x, 1) + sum_{x in U} [f(x)*loss(x,1) + (1-f(x))*loss(x,0)]

주요 결과¶

이론적 결과¶

정리	내용
Theorem 1	SCAR 가정 하에서 g(x) = c * f(x)
Theorem 2	c는 E[g(x)\|y=1]로 추정 가능
Theorem 3	가중치 기반 학습은 완전 레이블 데이터와 동등

실험 결과¶

UCI 데이터셋에서의 성능 (AUC):

데이터셋	Naive (U=N)	Elkan-Noto	완전 레이블
Adult	0.82	0.88	0.90
Mushroom	0.95	0.99	0.99
Spam	0.91	0.96	0.97

Naive 방법 대비 평균 5-10% AUC 향상
완전 레이블 데이터 성능에 근접

실전 적용 가이드¶

언제 사용하나¶

음성 레이블 수집이 어려운 경우
의료: 미발병 확인 어려움
사기 탐지: 미탐지 사기 존재
양성 레이블이 희소한 경우
추천 시스템의 implicit feedback
희귀 이벤트 예측
SCAR 가정이 합리적인 경우
레이블링이 무작위로 이루어진 경우
선택 편향이 작은 경우

하이퍼파라미터¶

파라미터	설명	권장값
base_estimator	기본 분류기	LogisticRegression, RandomForest
hold_out_ratio	c 추정용 검증 비율	0.1 - 0.2
n_iterations	EM 반복 횟수 (반복적 방법 사용시)	10 - 50

주의사항¶

SCAR 가정 검증: 레이블된 양성이 전체 양성의 무작위 샘플인지 확인. 선택 편향 있으면 성능 저하.
클래스 불균형: Unlabeled에 양성 비율이 매우 낮으면 c 추정 불안정. 더 많은 레이블 데이터 필요.
c 추정 오차: c가 과소추정되면 확률이 1을 초과. 클리핑 또는 정규화 필요.
음성 비율 사전 지식: 가능하다면 도메인 지식으로 c 범위 제한.

Python 코드 예시¶

import numpy as np
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split
from sklearn.metrics import roc_auc_score, classification_report
from sklearn.datasets import make_classification

# 데이터 생성 및 PU 시나리오 시뮬레이션
def create_pu_data(n_samples=1000, label_frequency=0.3, random_state=42):
    """
    PU learning 시나리오 데이터 생성.

    Parameters:
    -----------
    n_samples : int
    label_frequency : float, P(s=1|y=1) - 양성 중 레이블된 비율

    Returns:
    --------
    X, y_true, s (labeled indicator)
    """
    np.random.seed(random_state)

    X, y_true = make_classification(
        n_samples=n_samples,
        n_features=20,
        n_informative=10,
        n_redundant=5,
        n_classes=2,
        weights=[0.7, 0.3],  # 30% positive
        random_state=random_state
    )

    # SCAR 가정: 양성 중 일부만 레이블
    s = np.zeros(n_samples, dtype=int)
    positive_idx = np.where(y_true == 1)[0]
    n_labeled = int(len(positive_idx) * label_frequency)
    labeled_idx = np.random.choice(positive_idx, n_labeled, replace=False)
    s[labeled_idx] = 1

    return X, y_true, s


class PUClassifier:
    """
    Elkan & Noto (2008) PU Learning 구현.
    """

    def __init__(self, base_estimator=None, hold_out_ratio=0.1):
        self.base_estimator = base_estimator or LogisticRegression(max_iter=1000)
        self.hold_out_ratio = hold_out_ratio
        self.c_ = None

    def fit(self, X, s):
        """
        Parameters:
        -----------
        X : feature matrix
        s : label indicator (1=labeled positive, 0=unlabeled)
        """
        # Hold-out set for c estimation
        labeled_idx = np.where(s == 1)[0]
        n_holdout = max(1, int(len(labeled_idx) * self.hold_out_ratio))

        holdout_idx = np.random.choice(labeled_idx, n_holdout, replace=False)
        train_mask = np.ones(len(s), dtype=bool)
        train_mask[holdout_idx] = False

        X_train, s_train = X[train_mask], s[train_mask]
        X_holdout = X[holdout_idx]

        # Step 1: Train g(x) = P(s=1|x)
        self.base_estimator.fit(X_train, s_train)

        # Step 2: Estimate c using hold-out
        g_holdout = self.base_estimator.predict_proba(X_holdout)[:, 1]
        self.c_ = np.mean(g_holdout)

        # Prevent division issues
        self.c_ = np.clip(self.c_, 0.01, 1.0)

        return self

    def predict_proba(self, X):
        """
        Returns calibrated P(y=1|x) = g(x) / c
        """
        g_x = self.base_estimator.predict_proba(X)[:, 1]
        f_x = g_x / self.c_
        f_x = np.clip(f_x, 0, 1)
        return np.column_stack([1 - f_x, f_x])

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)


# 사용 예시
X, y_true, s = create_pu_data(n_samples=2000, label_frequency=0.4)

# Train/test split
X_train, X_test, y_train, y_test, s_train, s_test = train_test_split(
    X, y_true, s, test_size=0.3, random_state=42
)

# Naive 방법: Unlabeled를 Negative로 취급
naive_clf = LogisticRegression(max_iter=1000)
naive_clf.fit(X_train, s_train)
y_naive = naive_clf.predict_proba(X_test)[:, 1]

# PU Learning
pu_clf = PUClassifier(hold_out_ratio=0.2)
pu_clf.fit(X_train, s_train)
y_pu = pu_clf.predict_proba(X_test)[:, 1]

# 평가
print("=== Performance Comparison ===")
print(f"\nEstimated label frequency c: {pu_clf.c_:.3f}")
print(f"\nNaive (U=N) AUC: {roc_auc_score(y_test, y_naive):.4f}")
print(f"PU Learning AUC: {roc_auc_score(y_test, y_pu):.4f}")

print("\n=== Classification Report (PU Learning) ===")
print(classification_report(y_test, pu_clf.predict(X_test)))

고급: EM 기반 반복적 PU Learning¶

class IterativePUClassifier:
    """
    EM-style iterative PU learning.
    """

    def __init__(self, base_estimator=None, n_iterations=10, tol=1e-4):
        self.base_estimator = base_estimator or LogisticRegression(max_iter=1000)
        self.n_iterations = n_iterations
        self.tol = tol
        self.c_history_ = []

    def fit(self, X, s):
        n_samples = len(s)
        labeled_idx = np.where(s == 1)[0]
        unlabeled_idx = np.where(s == 0)[0]

        # Initial weights
        weights = np.ones(n_samples)
        prev_c = 0

        for iteration in range(self.n_iterations):
            # E-step: Estimate P(y=1|x) for unlabeled
            if iteration == 0:
                # First iteration: treat U as N
                y_train = s.copy()
            else:
                y_train = s.copy().astype(float)
                probs = self.base_estimator.predict_proba(X[unlabeled_idx])[:, 1]
                y_train[unlabeled_idx] = probs / self.c_
                y_train = np.clip(y_train, 0, 1)

            # M-step: Retrain classifier
            sample_weight = np.ones(n_samples)
            sample_weight[unlabeled_idx] = np.abs(2 * y_train[unlabeled_idx] - 1)

            self.base_estimator.fit(X, (y_train > 0.5).astype(int), 
                                    sample_weight=sample_weight)

            # Update c
            g_labeled = self.base_estimator.predict_proba(X[labeled_idx])[:, 1]
            self.c_ = np.clip(np.mean(g_labeled), 0.01, 1.0)
            self.c_history_.append(self.c_)

            # Check convergence
            if abs(self.c_ - prev_c) < self.tol:
                print(f"Converged at iteration {iteration + 1}")
                break
            prev_c = self.c_

        return self

    def predict_proba(self, X):
        g_x = self.base_estimator.predict_proba(X)[:, 1]
        f_x = np.clip(g_x / self.c_, 0, 1)
        return np.column_stack([1 - f_x, f_x])

    def predict(self, X, threshold=0.5):
        return (self.predict_proba(X)[:, 1] >= threshold).astype(int)

기법	가정	장점	단점
Elkan-Noto	SCAR	간단, 이론적 보장	선택 편향에 취약
Bagging-SVM	-	이상치에 강건	계산 비용 높음
Cost-sensitive	비용 비율 필요	유연함	비용 설정 어려움
nnPU	비음성 위험	딥러닝 호환	복잡한 최적화