컴퓨터 비전 개요¶

컴퓨터 비전(Computer Vision)은 디지털 이미지나 비디오로부터 의미 있는 정보를 추출하고 이해하는 인공지능 분야다. 이미지 분류, 객체 탐지, 세그멘테이션, 생성 등 다양한 태스크를 다룬다.

핵심 개념¶

이미지 표현¶

표현	형태	특징
RGB	(H, W, 3)	가장 일반적
Grayscale	(H, W, 1)	색상 정보 제거
HSV	(H, W, 3)	색상/채도/명도
Feature Map	(H', W', C)	CNN 중간 출력

데이터 증강 (Data Augmentation)¶

import albumentations as A
from albumentations.pytorch import ToTensorV2

transform = A.Compose([
    A.RandomResizedCrop(224, 224, scale=(0.8, 1.0)),
    A.HorizontalFlip(p=0.5),
    A.ColorJitter(brightness=0.2, contrast=0.2, saturation=0.2, hue=0.1),
    A.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225]),
    ToTensorV2(),
])

알고리즘 분류 체계¶

Computer Vision Tasks
├── Image Classification
│   ├── CNN: VGG, ResNet, EfficientNet
│   └── ViT: Vision Transformer, DeiT, Swin
├── Object Detection
│   ├── Two-stage: R-CNN, Fast/Faster R-CNN
│   ├── One-stage: YOLO, SSD, RetinaNet
│   └── Anchor-free: FCOS, CenterNet
├── Semantic Segmentation
│   ├── FCN, U-Net, DeepLab
│   └── Transformer: SegFormer, Mask2Former
├── Instance Segmentation
│   ├── Mask R-CNN
│   └── SOLO, YOLACT
├── Pose Estimation
│   ├── OpenPose, HRNet
│   └── MediaPipe
├── Image Generation
│   ├── GAN: StyleGAN, BigGAN
│   ├── VAE: VQ-VAE
│   └── Diffusion: Stable Diffusion, DALL-E
└── Video Understanding
    ├── Action Recognition
    └── Video Object Tracking

이미지 분류 (Image Classification)¶

CNN 아키텍처 발전¶

모델	연도	Top-1 (ImageNet)	특징
AlexNet	2012	63.3%	CNN의 시작
VGG	2014	74.5%	3x3 필터, 깊은 네트워크
GoogLeNet	2014	74.8%	Inception 모듈
ResNet	2015	78.6%	Skip Connection
DenseNet	2017	79.2%	Dense Connection
EfficientNet	2019	84.4%	Compound Scaling
ViT	2020	88.5%*	Transformer
Swin	2021	87.3%	Shifted Window
ConvNeXt	2022	87.8%	CNN 재설계

ResNet (Residual Network)¶

\[\mathbf{y} = \mathcal{F}(\mathbf{x}, \{W_i\}) + \mathbf{x}\]

Skip connection으로 깊은 네트워크 학습 가능.

Vision Transformer (ViT)¶

이미지를 패치로 분할하여 Transformer 적용:

Image → Patches → Linear Projection → Transformer Encoder → Classification

import timm

# 사전학습 모델 로드
model = timm.create_model('vit_base_patch16_224', pretrained=True, num_classes=10)

# 추론
with torch.no_grad():
    output = model(image_tensor)
    pred = output.argmax(dim=1)

참고 논문: - He, K. et al. (2016). "Deep Residual Learning for Image Recognition". CVPR. - Dosovitskiy, A. et al. (2020). "An Image is Worth 16x16 Words". ICLR. - Liu, Z. et al. (2021). "Swin Transformer". ICCV.

객체 탐지 (Object Detection)¶

YOLO (You Only Look Once)¶

Single-shot detector로 실시간 탐지:

버전	연도	특징
YOLOv1	2015	최초 Single-shot
YOLOv3	2018	Multi-scale
YOLOv5	2020	PyTorch, 실용적
YOLOv8	2023	Ultralytics, SOTA
YOLO11	2024	최신

from ultralytics import YOLO

# 모델 로드
model = YOLO('yolov8n.pt')

# 추론
results = model('image.jpg')

# 결과 시각화
results[0].plot()
results[0].boxes  # Bounding boxes
results[0].masks  # Segmentation masks (if applicable)

Faster R-CNN¶

Two-stage detector: 1. RPN (Region Proposal Network): 후보 영역 생성 2. RoI Pooling + Classification: 분류 및 박스 회귀

참고 논문: - Redmon, J. et al. (2016). "You Only Look Once". CVPR. - Ren, S. et al. (2015). "Faster R-CNN". NeurIPS.

세그멘테이션 (Segmentation)¶

Semantic Segmentation¶

픽셀별 클래스 분류.

U-Net: 인코더-디코더 구조 + Skip connection

import segmentation_models_pytorch as smp

model = smp.Unet(
    encoder_name="resnet34",
    encoder_weights="imagenet",
    in_channels=3,
    classes=num_classes
)

Instance Segmentation¶

객체별 마스크 + 클래스 분류.

Mask R-CNN: Faster R-CNN + 마스크 예측 브랜치

SAM (Segment Anything Model, Meta 2023): 프롬프트 기반 범용 세그멘테이션

from segment_anything import SamPredictor, sam_model_registry

sam = sam_model_registry["vit_h"](checkpoint="sam_vit_h.pth")
predictor = SamPredictor(sam)

predictor.set_image(image)
masks, scores, _ = predictor.predict(
    point_coords=np.array([[500, 375]]),
    point_labels=np.array([1])
)

참고 논문: - Ronneberger, O. et al. (2015). "U-Net". MICCAI. - He, K. et al. (2017). "Mask R-CNN". ICCV. - Kirillov, A. et al. (2023). "Segment Anything". ICCV.

이미지 생성 (Image Generation)¶

Diffusion Models¶

노이즈 제거 과정을 학습:

Forward process: $$q(x_t | x_{t-1}) = \mathcal{N}(x_t; \sqrt{1-\beta_t}x_{t-1}, \beta_t I)$$

Reverse process: $$p_\theta(x_{t-1} | x_t) = \mathcal{N}(x_{t-1}; \mu_\theta(x_t, t), \Sigma_\theta(x_t, t))$$

from diffusers import StableDiffusionPipeline

pipe = StableDiffusionPipeline.from_pretrained(
    "runwayml/stable-diffusion-v1-5",
    torch_dtype=torch.float16
).to("cuda")

image = pipe("a photo of an astronaut riding a horse").images[0]

참고 논문: - Ho, J. et al. (2020). "Denoising Diffusion Probabilistic Models". NeurIPS. - Rombach, R. et al. (2022). "High-Resolution Image Synthesis with Latent Diffusion Models". CVPR.

멀티모달 (Vision-Language)¶

CLIP (Contrastive Language-Image Pre-training)¶

이미지와 텍스트를 동일 공간에 임베딩:

import clip

model, preprocess = clip.load("ViT-B/32", device="cuda")

image = preprocess(Image.open("image.jpg")).unsqueeze(0).to("cuda")
text = clip.tokenize(["a cat", "a dog", "a bird"]).to("cuda")

with torch.no_grad():
    image_features = model.encode_image(image)
    text_features = model.encode_text(text)

    similarity = (100.0 * image_features @ text_features.T).softmax(dim=-1)

참고 논문: - Radford, A. et al. (2021). "Learning Transferable Visual Models From Natural Language Supervision". ICML.

평가 지표¶

태스크	지표
분류	Accuracy, Top-5 Accuracy
탐지	mAP@IoU (0.5, 0.5:0.95)
세그멘테이션	mIoU, Pixel Accuracy
생성	FID, IS, CLIP Score

참고 문헌¶

교과서¶

Szeliski, R. (2022). "Computer Vision: Algorithms and Applications" (2nd ed). Springer. (무료 온라인)

핵심 논문¶

He, K. et al. (2016). "ResNet". CVPR.
Dosovitskiy, A. et al. (2020). "ViT". ICLR.
Kirillov, A. et al. (2023). "SAM". ICCV.

라이브러리¶

PyTorch Image Models (timm): https://github.com/huggingface/pytorch-image-models
Detectron2: https://github.com/facebookresearch/detectron2
Ultralytics: https://github.com/ultralytics/ultralytics
Hugging Face Diffusers: https://github.com/huggingface/diffusers