Contrastive Learning (대조 학습)¶

메타 정보¶

항목	내용
분류	Self-Supervised Learning / Representation Learning
핵심 논문	"A Simple Framework for Contrastive Learning of Visual Representations" (Chen et al., ICML 2020 - SimCLR), "Momentum Contrast for Unsupervised Visual Representation Learning" (He et al., CVPR 2020 - MoCo), "Bootstrap Your Own Latent" (Grill et al., NeurIPS 2020 - BYOL), "Representation Learning with Contrastive Predictive Coding" (van den Oord et al., 2018 - CPC/InfoNCE)
주요 저자	Ting Chen, Geoffrey Hinton (SimCLR); Kaiming He (MoCo); Jean-Baptiste Grill, Yann LeCun (BYOL); Aaron van den Oord (CPC)
핵심 개념	유사한 샘플(positive pair)은 가깝게, 다른 샘플(negative pair)은 멀게 표현 공간을 학습하는 자기지도학습 프레임워크
관련 분야	Self-Supervised Learning, Metric Learning, Transfer Learning, Foundation Models

정의¶

Contrastive Learning은 레이블 없이 데이터의 구조적 유사성을 활용하여 표현(representation)을 학습하는 자기지도학습(self-supervised learning) 패러다임이다. 핵심 아이디어는 동일 데이터의 다른 뷰(augmented view)를 positive pair로, 다른 데이터를 negative pair로 구성한 뒤, 표현 공간에서 positive는 가깝게, negative는 멀게 만드는 것이다.

Contrastive Learning의 핵심 구조:

입력 x
  |
  +-- augmentation --> view_1 --+
  |                             |-- positive pair (가깝게)
  +-- augmentation --> view_2 --+
  |
  +-- 다른 샘플 y의 view --------+-- negative pair (멀게)

손실 함수:
  L = -log [ exp(sim(z_i, z_j)/tau) / sum_k exp(sim(z_i, z_k)/tau) ]
       ^                                    ^
       positive pair의 유사도              모든 pair의 유사도 합

핵심 원리¶

1. InfoNCE 손실 함수¶

Contrastive learning의 표준 손실 함수는 InfoNCE (Noise-Contrastive Estimation)이다.

# InfoNCE Loss (NT-Xent in SimCLR)
import torch
import torch.nn.functional as F

def info_nce_loss(z_i, z_j, temperature=0.5):
    """
    z_i, z_j: 같은 배치의 두 augmented view 표현
    shape: (batch_size, embedding_dim)
    """
    batch_size = z_i.shape[0]

    # L2 정규화
    z_i = F.normalize(z_i, dim=1)
    z_j = F.normalize(z_j, dim=1)

    # 전체 표현 결합: [z_i; z_j] -> (2N, d)
    representations = torch.cat([z_i, z_j], dim=0)

    # 코사인 유사도 행렬: (2N, 2N)
    similarity_matrix = F.cosine_similarity(
        representations.unsqueeze(1),
        representations.unsqueeze(0),
        dim=2
    ) / temperature

    # positive pair 마스크 생성
    # (i, i+N)과 (i+N, i)가 positive pair
    labels = torch.cat([
        torch.arange(batch_size, 2 * batch_size),
        torch.arange(batch_size)
    ])

    # 자기 자신과의 유사도 제거 (대각선)
    mask = torch.eye(2 * batch_size, dtype=torch.bool)
    similarity_matrix.masked_fill_(mask, -float('inf'))

    # Cross-entropy loss
    loss = F.cross_entropy(similarity_matrix, labels)
    return loss

2. Temperature의 역할¶

tau 값	효과	특징
낮음 (0.05-0.1)	분포가 sharp	hard negative에 집중, 학습 불안정 가능
보통 (0.1-0.5)	균형 잡힌 분포	SimCLR 기본값 0.5, MoCo 기본값 0.07
높음 (1.0+)	분포가 uniform	모든 negative를 동등하게 취급, 학습 신호 약함

3. Data Augmentation 전략¶

Contrastive learning의 성능은 augmentation 선택에 크게 좌우된다.

Vision 도메인 주요 augmentation:

[강한 영향]
  Random Resized Crop   -- 가장 중요, 공간적 불변성 학습
  Color Distortion      -- 색상 지름길(shortcut) 방지

[보통 영향]
  Gaussian Blur         -- 텍스처 의존도 감소
  Random Horizontal Flip -- 좌우 불변성

[약한 영향]
  Rotation              -- 회전 불변성 (도메인 의존적)
  Solarization          -- SimCLR v2에서 추가

Chen et al. (2020) 실험 결과:
  Crop만 사용: 63.8% (ImageNet linear eval)
  Crop + Color: 75.5% (+11.7%p)
  모든 augmentation: 76.5%

주요 프레임워크 비교¶

Contrastive 방식 (negative pair 사용)¶

프레임워크	연도	핵심 메커니즘	negative 소스	주요 성과
SimCLR	2020	큰 배치 + projection head	같은 배치 내 다른 샘플	ImageNet top-1 76.5% (linear)
MoCo v1/v2	2020	momentum encoder + queue	메모리 큐 (65536개)	배치 크기 독립적, 효율적
SimCLR v2	2020	더 큰 모델 + semi-supervised	같은 배치	80.9% (semi-supervised)
MoCo v3	2021	ViT 백본 + trick 제거	배치 내	ViT 기반 self-supervised 선도

Non-Contrastive 방식 (negative pair 불필요)¶

프레임워크	연도	핵심 메커니즘	collapse 방지	주요 성과
BYOL	2020	momentum target + predictor	momentum encoder	negative 없이 SimCLR 능가
SimSiam	2021	stop-gradient	stop-gradient만으로 충분	가장 단순한 구조
Barlow Twins	2021	cross-correlation 행렬	redundancy reduction	정보 이론 기반
VICReg	2022	variance + invariance + covariance	명시적 3가지 정규화	가장 직관적
DINO/DINOv2	2021/2023	self-distillation + centering	centering + sharpening	ViT에서 SOTA, 범용 특징

구조 비교¶

SimCLR:
  x --> aug1 --> encoder --> projector --> z_i --+
  x --> aug2 --> encoder --> projector --> z_j --+--> InfoNCE Loss
  (같은 배치의 다른 샘플이 negative)

MoCo:
  x --> aug1 --> encoder  --> projector --> q --+
  x --> aug2 --> momentum_encoder --> k --------+--> InfoNCE Loss
  (momentum encoder: theta_k = m * theta_k + (1-m) * theta_q)
  (queue: 이전 배치의 key를 저장하여 negative pool 확장)

BYOL:
  x --> aug1 --> online_encoder  --> predictor --> p --+
  x --> aug2 --> target_encoder  --> z ----------------+--> MSE Loss
  (target = EMA of online, stop-gradient on target)
  (negative pair 없음!)

VICReg:
  x --> aug1 --> encoder --> projector --> z_i --+
  x --> aug2 --> encoder --> projector --> z_j --+--> L_inv + L_var + L_cov
  L_inv: invariance (positive pair 거리 최소화)
  L_var: variance (각 차원의 분산 유지, collapse 방지)
  L_cov: covariance (차원 간 상관 제거, dimensional collapse 방지)

Dimensional Collapse 문제¶

Contrastive learning의 핵심 실패 모드는 dimensional collapse -- 표현이 전체 차원 공간을 활용하지 못하고 일부 차원에만 집중되는 현상이다.

정상적인 표현 공간:        Dimensional Collapse:
  d1                         d1
  ^    . . .                 ^    . . . . .
  |   . . . .                |    . . . . .
  |  . . . . .               |
  | . . . . . .              |
  +-----------> d2           +-----------> d2
  (모든 차원 활용)           (d1 축에만 분포)

방지 전략	프레임워크	메커니즘
Large batch + negative	SimCLR	충분한 negative로 균일 분포 유도
Momentum encoder	MoCo, BYOL	target 표현을 천천히 업데이트
Stop-gradient	SimSiam, BYOL	한쪽 branch의 gradient 차단
Variance regularization	VICReg	각 차원의 분산을 명시적으로 유지
Whitening	W-MSE	표현을 whitening하여 상관 제거
Centering + Sharpening	DINO	mean 제거 + entropy 최소화

도메인별 적용¶

Vision¶

# SimCLR 전체 파이프라인 (PyTorch)
import torch
import torch.nn as nn
import torchvision.transforms as T
from torchvision.models import resnet50

class SimCLR(nn.Module):
    def __init__(self, base_encoder=resnet50, projection_dim=128):
        super().__init__()
        self.encoder = base_encoder(pretrained=False)
        dim = self.encoder.fc.in_features  # 2048 for ResNet-50
        self.encoder.fc = nn.Identity()

        # Projection head (학습 시 사용, 평가 시 제거)
        self.projector = nn.Sequential(
            nn.Linear(dim, dim),
            nn.ReLU(),
            nn.Linear(dim, projection_dim)
        )

    def forward(self, x):
        h = self.encoder(x)       # representation (2048-d)
        z = self.projector(h)      # projection (128-d)
        return h, z

# Augmentation 파이프라인
simclr_transform = T.Compose([
    T.RandomResizedCrop(224, scale=(0.2, 1.0)),
    T.RandomApply([T.ColorJitter(0.4, 0.4, 0.4, 0.1)], p=0.8),
    T.RandomGrayscale(p=0.2),
    T.RandomApply([T.GaussianBlur(kernel_size=23)], p=0.5),
    T.RandomHorizontalFlip(),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406],
                std=[0.229, 0.224, 0.225])
])

# 학습 루프
model = SimCLR()
optimizer = torch.optim.Adam(model.parameters(), lr=3e-4, weight_decay=1e-6)

for images, _ in dataloader:  # 레이블 사용 안 함
    # 두 가지 augmented view 생성
    x_i = simclr_transform(images)
    x_j = simclr_transform(images)

    _, z_i = model(x_i)
    _, z_j = model(x_j)

    loss = info_nce_loss(z_i, z_j, temperature=0.5)

    optimizer.zero_grad()
    loss.backward()
    optimizer.step()

# 평가: linear probing (encoder 고정, linear classifier 학습)
model.encoder.eval()
classifier = nn.Linear(2048, num_classes)

NLP -- SimCSE¶

# SimCSE: Unsupervised (dropout을 augmentation으로 활용)
from transformers import AutoModel, AutoTokenizer

class SimCSE(nn.Module):
    def __init__(self, model_name='bert-base-uncased'):
        super().__init__()
        self.encoder = AutoModel.from_pretrained(model_name)

    def forward(self, input_ids, attention_mask):
        # 같은 입력을 두 번 forward -> dropout이 다른 view 생성
        outputs1 = self.encoder(input_ids, attention_mask=attention_mask)
        outputs2 = self.encoder(input_ids, attention_mask=attention_mask)

        z1 = outputs1.last_hidden_state[:, 0]  # [CLS] token
        z2 = outputs2.last_hidden_state[:, 0]
        return z1, z2

# 학습: 같은 문장의 두 dropout view가 positive pair
# 배치 내 다른 문장이 negative pair
# 결과: STS 벤치마크에서 supervised baseline과 동등한 성능

시계열 -- SoftCLT (ICLR 2024)¶

# SoftCLT: 시계열의 시간적 근접성을 soft contrastive weight로 활용

def soft_contrastive_loss(z_i, z_j, timestamps_i, timestamps_j, tau=0.5):
    """
    기존 contrastive: binary positive/negative
    SoftCLT: 시간적 거리에 따른 연속적 weight

    시간적으로 가까운 timestamp --> 높은 positive weight
    시간적으로 먼 timestamp --> 낮은 positive weight (hard negative 아님)
    """
    # 시간 차이에 기반한 soft label
    time_diff = torch.abs(timestamps_i.unsqueeze(1) - timestamps_j.unsqueeze(0))
    soft_labels = torch.exp(-time_diff / sigma)  # 거리가 클수록 weight 감소

    # 유사도 계산
    sim = F.cosine_similarity(z_i.unsqueeze(1), z_j.unsqueeze(0), dim=-1) / tau

    # soft cross-entropy
    loss = -torch.sum(soft_labels * F.log_softmax(sim, dim=1)) / batch_size
    return loss

멀티모달 -- CLIP¶

CLIP (Contrastive Language-Image Pre-training):

  이미지 --> Image Encoder (ViT) --> I_1, I_2, ..., I_N
  텍스트 --> Text Encoder (Transformer) --> T_1, T_2, ..., T_N

  유사도 행렬 (N x N):
         T_1    T_2    T_3   ... T_N
  I_1  [match]  neg    neg   ... neg
  I_2   neg   [match]  neg   ... neg
  I_3   neg    neg   [match] ... neg
  ...
  I_N   neg    neg    neg   ...[match]

  대각선 = positive pair (매칭된 이미지-텍스트)
  비대각선 = negative pair

  손실: image-to-text CE + text-to-image CE (양방향)
  학습 데이터: 4억 개 이미지-텍스트 쌍 (웹 크롤링)
  결과: zero-shot ImageNet 76.2% (supervised ResNet-50과 동등)

핵심 설계 결정 가이드¶

Q: 레이블이 있는가?
  |
  +-- Yes --> Supervised Contrastive Loss (같은 클래스 = positive)
  |           SupCon (Khosla et al., NeurIPS 2020)
  |
  +-- No --> Self-Supervised Contrastive
              |
              Q: GPU 메모리가 충분한가?
              |
              +-- Yes (large batch 가능) --> SimCLR (단순, 효과적)
              |
              +-- No (small batch) --> MoCo (메모리 큐로 negative 확보)
              |                    --> BYOL (negative 불필요)
              |
              Q: 구현 단순성이 중요한가?
              |
              +-- Yes --> VICReg (직관적 3가지 손실)
              |       --> SimSiam (가장 단순)
              |
              +-- No --> DINO/DINOv2 (최고 성능, ViT 기반)

한계와 최근 동향¶

알려진 한계¶

한계	설명	완화 방법
Augmentation 의존성	도메인별 augmentation 설계 필요	학습 가능한 augmentation (AutoAugment)
대규모 배치/메모리	SimCLR: 배치 4096+, MoCo: 큐 65536	Non-contrastive 방식 (BYOL, VICReg)
Feature suppression	일부 task-relevant 특징만 학습	Multi-crop, asymmetric augmentation
Semantic alignment	의미적으로 유사한 다른 샘플을 negative로 취급	Supervised contrastive, nearest-neighbor mining
Dimensional collapse	표현이 저차원 부분공간에 집중	Whitening, variance regularization

2024-2025 주요 동향¶

DINOv2 (Meta, 2023-2024): 142M 이미지로 사전학습, 범용 visual feature extractor로 자리매김. ImageNet, segmentation, depth estimation 등 다양한 downstream에서 SOTA.
Soft Contrastive Learning (ICLR 2024): 시계열 도메인에서 binary positive/negative 대신 연속적 soft label 사용. 시간적 근접성을 반영하여 더 세밀한 표현 학습.
Dimensional Collapse 이론 (NeurIPS 2024): InfoNCE 손실에서 dimensional collapse가 발생하는 이론적 조건 규명. Whitening 기반 해결책 제시.
Foundation Model과의 통합: CLIP 계열 multimodal contrastive learning이 LLM/VLM의 vision encoder 사전학습 표준으로 정착 (LLaVA, GPT-4V 등의 vision component).
Tabular/Graph 도메인 확장: SCARF (tabular self-supervised), GraphCL (graph contrastive learning) 등 비전통적 도메인으로의 확장 활발.

참고 자료¶

자료	유형	링크
SimCLR 원논문	ICML 2020	arxiv.org/abs/2002.05709
MoCo 원논문	CVPR 2020	arxiv.org/abs/1911.05722
BYOL 원논문	NeurIPS 2020	arxiv.org/abs/2006.07733
InfoNCE / CPC	arXiv 2018	arxiv.org/abs/1807.03748
VICReg	ICLR 2022	arxiv.org/abs/2105.04906
CLIP	ICML 2021	arxiv.org/abs/2103.00020
SimCSE	EMNLP 2021	arxiv.org/abs/2104.08821
SupCon	NeurIPS 2020	arxiv.org/abs/2004.11362
DINOv2	arXiv 2023	arxiv.org/abs/2304.07193
SoftCLT	ICLR 2024	arxiv.org/abs/2312.16424
Lilian Weng 블로그	튜토리얼	lilianweng.github.io/posts/2021-05-31-contrastive/
Dimensional Collapse in SSL	NeurIPS 2024	proceedings.neurips.cc/paper/2024/ad7922fd