Kubernetes for ML¶

개요¶

Kubernetes란?¶

Kubernetes(K8s)는 컨테이너화된 애플리케이션의 배포, 스케일링, 관리를 자동화하는 오픈소스 오케스트레이션 플랫폼이다. Google이 내부에서 사용하던 Borg 시스템을 기반으로 2014년 오픈소스로 공개했으며, 현재 CNCF(Cloud Native Computing Foundation)에서 관리한다.

왜 ML 워크로드에 Kubernetes를 사용하는가?¶

ML/AI 워크로드는 다음과 같은 특성 때문에 Kubernetes가 적합하다:

특성	설명	K8s 해결책
리소스 집약적	GPU, 대용량 메모리 필요	동적 리소스 할당, GPU 스케줄링
가변적 워크로드	학습(일시적) vs 추론(상시)	Job, Deployment 리소스 분리
스케일링 필요	트래픽/데이터 증가에 대응	HPA, Karpenter로 자동 스케일링
재현성 요구	동일 환경에서 동일 결과	컨테이너화로 환경 일관성 보장
비용 최적화	고가 GPU 효율적 활용	Spot 인스턴스, 리소스 공유

핵심 특징¶

선언적 구성: YAML로 원하는 상태를 정의하면 K8s가 자동으로 맞춤
자가 치유: Pod 장애 시 자동 재시작/재배포
서비스 디스커버리: 내부 DNS로 서비스 간 통신 자동화
롤링 업데이트: 무중단 배포 지원
시크릿 관리: API 키, 모델 경로 등 민감 정보 안전 관리

1. 핵심 개념¶

1.1 K8s 리소스 계층¶

Kubernetes 리소스 계층 구조

계층 설명:

Cluster: 노드들의 집합. 마스터 노드(컨트롤 플레인)와 워커 노드로 구성
Namespace: 리소스의 논리적 분리. 팀별, 환경별(dev/staging/prod) 구분에 사용
Workload Resources: 실제 컨테이너를 실행하는 리소스들
Deployment: 상시 실행 (웹서버, API, 추론 서비스)
Job/CronJob: 일회성/정기 작업 (학습, 배치 처리)
StatefulSet: 상태 유지 필요 (분산 학습, 데이터베이스)
Pod: 컨테이너 실행의 최소 단위. 하나 이상의 컨테이너가 네트워크/스토리지 공유

1.2 ML 워크로드 유형¶

리소스	용도	특징
Job	학습 작업	완료 후 종료
Deployment	추론 서비스	상시 실행, 오토스케일링
CronJob	정기 학습/평가	스케줄 기반
StatefulSet	분산 학습	안정적인 네트워크 ID

2. Pod 기본¶

2.1 ML 추론 Pod¶

apiVersion: v1
kind: Pod
metadata:
  name: ml-inference
  labels:
    app: ml-model
spec:
  containers:
    - name: inference
      image: ml-model:v1.0
      ports:
        - containerPort: 8000
      resources:
        requests:
          memory: "2Gi"
          cpu: "1"
        limits:
          memory: "4Gi"
          cpu: "2"
      env:
        - name: MODEL_PATH
          value: "/models/model.pkl"
        - name: MLFLOW_TRACKING_URI
          valueFrom:
            secretKeyRef:
              name: ml-secrets
              key: mlflow-uri
      volumeMounts:
        - name: model-storage
          mountPath: /models
      livenessProbe:
        httpGet:
          path: /health
          port: 8000
        initialDelaySeconds: 30
        periodSeconds: 10
      readinessProbe:
        httpGet:
          path: /ready
          port: 8000
        initialDelaySeconds: 5
        periodSeconds: 5
  volumes:
    - name: model-storage
      persistentVolumeClaim:
        claimName: model-pvc

2.2 GPU Pod¶

apiVersion: v1
kind: Pod
metadata:
  name: gpu-training
spec:
  containers:
    - name: training
      image: pytorch/pytorch:2.0.1-cuda11.7-cudnn8-runtime
      command: ["python", "train.py"]
      resources:
        limits:
          nvidia.com/gpu: 1  # GPU 1개 요청
          memory: "32Gi"
          cpu: "8"
      volumeMounts:
        - name: data
          mountPath: /data
        - name: shm  # PyTorch DataLoader를 위한 공유 메모리
          mountPath: /dev/shm
  volumes:
    - name: data
      persistentVolumeClaim:
        claimName: training-data-pvc
    - name: shm
      emptyDir:
        medium: Memory
        sizeLimit: "16Gi"
  nodeSelector:
    nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
  tolerations:
    - key: "nvidia.com/gpu"
      operator: "Exists"
      effect: "NoSchedule"

3. Deployment (모델 서빙)¶

3.1 기본 Deployment¶

apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model
  labels:
    app: ml-model
spec:
  replicas: 3
  selector:
    matchLabels:
      app: ml-model
  template:
    metadata:
      labels:
        app: ml-model
    spec:
      containers:
        - name: model
          image: ml-model:v1.0
          ports:
            - containerPort: 8000
          resources:
            requests:
              memory: "2Gi"
              cpu: "1"
            limits:
              memory: "4Gi"
              cpu: "2"
          env:
            - name: MODEL_VERSION
              value: "v1.0"
          livenessProbe:
            httpGet:
              path: /health
              port: 8000
            initialDelaySeconds: 30
            periodSeconds: 10
          readinessProbe:
            httpGet:
              path: /ready
              port: 8000
            initialDelaySeconds: 10
            periodSeconds: 5
  strategy:
    type: RollingUpdate
    rollingUpdate:
      maxSurge: 1
      maxUnavailable: 0

3.2 Service¶

apiVersion: v1
kind: Service
metadata:
  name: ml-model-service
spec:
  selector:
    app: ml-model
  ports:
    - port: 80
      targetPort: 8000
  type: ClusterIP
---
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: ml-model-ingress
  annotations:
    kubernetes.io/ingress.class: nginx
    nginx.ingress.kubernetes.io/proxy-body-size: "50m"
spec:
  rules:
    - host: ml.example.com
      http:
        paths:
          - path: /
            pathType: Prefix
            backend:
              service:
                name: ml-model-service
                port:
                  number: 80

3.3 HorizontalPodAutoscaler¶

apiVersion: autoscaling/v2
kind: HorizontalPodAutoscaler
metadata:
  name: ml-model-hpa
spec:
  scaleTargetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  minReplicas: 2
  maxReplicas: 20
  metrics:
    - type: Resource
      resource:
        name: cpu
        target:
          type: Utilization
          averageUtilization: 70
    - type: Resource
      resource:
        name: memory
        target:
          type: Utilization
          averageUtilization: 80
    # 커스텀 메트릭 (Prometheus)
    - type: Pods
      pods:
        metric:
          name: requests_per_second
        target:
          type: AverageValue
          averageValue: 100
  behavior:
    scaleDown:
      stabilizationWindowSeconds: 300
      policies:
        - type: Percent
          value: 10
          periodSeconds: 60
    scaleUp:
      stabilizationWindowSeconds: 0
      policies:
        - type: Percent
          value: 100
          periodSeconds: 15

4. Job (학습 작업)¶

4.1 기본 학습 Job¶

apiVersion: batch/v1
kind: Job
metadata:
  name: training-job
spec:
  backoffLimit: 3  # 실패 시 재시도 횟수
  activeDeadlineSeconds: 86400  # 24시간 타임아웃
  ttlSecondsAfterFinished: 3600  # 완료 후 1시간 뒤 삭제
  template:
    spec:
      restartPolicy: Never
      containers:
        - name: training
          image: ml-training:v1.0
          command: ["python", "train.py"]
          args:
            - "--epochs=100"
            - "--batch-size=32"
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
              cpu: "8"
          volumeMounts:
            - name: data
              mountPath: /data
            - name: output
              mountPath: /output
          env:
            - name: MLFLOW_TRACKING_URI
              valueFrom:
                secretKeyRef:
                  name: ml-secrets
                  key: mlflow-uri
      volumes:
        - name: data
          persistentVolumeClaim:
            claimName: training-data
        - name: output
          persistentVolumeClaim:
            claimName: training-output
      nodeSelector:
        node-type: gpu

4.2 CronJob (정기 재학습)¶

apiVersion: batch/v1
kind: CronJob
metadata:
  name: daily-retrain
spec:
  schedule: "0 2 * * *"  # 매일 새벽 2시
  concurrencyPolicy: Forbid
  jobTemplate:
    spec:
      template:
        spec:
          restartPolicy: OnFailure
          containers:
            - name: retrain
              image: ml-training:v1.0
              command: ["python", "retrain.py"]
              env:
                - name: DATE
                  value: "$(date +%Y-%m-%d)"

5. 분산 학습¶

5.1 PyTorch Distributed (StatefulSet)¶

apiVersion: apps/v1
kind: StatefulSet
metadata:
  name: pytorch-distributed
spec:
  serviceName: pytorch-distributed
  replicas: 4  # 워커 수
  selector:
    matchLabels:
      app: pytorch-distributed
  template:
    metadata:
      labels:
        app: pytorch-distributed
    spec:
      containers:
        - name: worker
          image: pytorch-training:v1.0
          command:
            - "torchrun"
            - "--nnodes=4"
            - "--nproc_per_node=1"
            - "--rdzv_id=job1"
            - "--rdzv_backend=c10d"
            - "--rdzv_endpoint=pytorch-distributed-0.pytorch-distributed:29500"
            - "train.py"
          ports:
            - containerPort: 29500
              name: rdzv
          resources:
            limits:
              nvidia.com/gpu: 1
              memory: "32Gi"
          env:
            - name: MASTER_ADDR
              value: "pytorch-distributed-0.pytorch-distributed"
            - name: MASTER_PORT
              value: "29500"
---
apiVersion: v1
kind: Service
metadata:
  name: pytorch-distributed
spec:
  clusterIP: None  # Headless Service
  selector:
    app: pytorch-distributed
  ports:
    - port: 29500
      name: rdzv

5.2 Kubeflow Training Operator¶

apiVersion: kubeflow.org/v1
kind: PyTorchJob
metadata:
  name: pytorch-distributed-job
spec:
  pytorchReplicaSpecs:
    Master:
      replicas: 1
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch-training:v1.0
              command:
                - "python"
                - "train.py"
                - "--backend=nccl"
              resources:
                limits:
                  nvidia.com/gpu: 1
    Worker:
      replicas: 3
      restartPolicy: OnFailure
      template:
        spec:
          containers:
            - name: pytorch
              image: pytorch-training:v1.0
              command:
                - "python"
                - "train.py"
                - "--backend=nccl"
              resources:
                limits:
                  nvidia.com/gpu: 1

6. GPU 관리¶

6.1 NVIDIA Device Plugin¶

# 설치
kubectl create -f https://raw.githubusercontent.com/NVIDIA/k8s-device-plugin/v0.14.0/nvidia-device-plugin.yml

# 확인
kubectl get nodes -o jsonpath='{.items[*].status.allocatable.nvidia\.com/gpu}'

6.2 GPU 공유 (Time-Slicing)¶

# ConfigMap for time-slicing
apiVersion: v1
kind: ConfigMap
metadata:
  name: nvidia-device-plugin-config
  namespace: kube-system
data:
  config.yaml: |
    version: v1
    sharing:
      timeSlicing:
        renameByDefault: false
        failRequestsGreaterThanOne: false
        resources:
          - name: nvidia.com/gpu
            replicas: 4  # GPU를 4개로 분할

6.3 GPU 노드 선택¶

# 특정 GPU 모델 선택
nodeSelector:
  nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB

# 또는 Node Affinity
affinity:
  nodeAffinity:
    requiredDuringSchedulingIgnoredDuringExecution:
      nodeSelectorTerms:
        - matchExpressions:
            - key: nvidia.com/gpu.memory
              operator: Gt
              values: ["40000"]  # 40GB 이상

7. 모니터링¶

7.1 Prometheus + Grafana¶

# ServiceMonitor for ML metrics
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
  name: ml-model-monitor
spec:
  selector:
    matchLabels:
      app: ml-model
  endpoints:
    - port: metrics
      interval: 15s
      path: /metrics

7.2 GPU 메트릭 (DCGM Exporter)¶

# DCGM Exporter 설치
helm repo add gpu-helm-charts https://nvidia.github.io/dcgm-exporter/helm-charts
helm install dcgm-exporter gpu-helm-charts/dcgm-exporter

# 주요 메트릭
# DCGM_FI_DEV_GPU_UTIL: GPU 사용률
# DCGM_FI_DEV_MEM_COPY_UTIL: 메모리 대역폭 사용률
# DCGM_FI_DEV_FB_USED: 사용 중인 메모리
# DCGM_FI_DEV_POWER_USAGE: 전력 사용량

8. Karpenter (오토스케일링)¶

8.1 GPU NodePool¶

apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: gpu-pool
spec:
  template:
    spec:
      requirements:
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["g5.xlarge", "g5.2xlarge", "g5.4xlarge"]
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]
        - key: kubernetes.io/arch
          operator: In
          values: ["amd64"]
      nodeClassRef:
        name: gpu-node-class
      taints:
        - key: nvidia.com/gpu
          value: "true"
          effect: NoSchedule
  limits:
    cpu: 1000
    memory: 4000Gi
    nvidia.com/gpu: 100
  disruption:
    consolidationPolicy: WhenEmpty
    consolidateAfter: 5m
---
apiVersion: karpenter.k8s.aws/v1beta1
kind: EC2NodeClass
metadata:
  name: gpu-node-class
spec:
  amiFamily: AL2
  role: KarpenterNodeRole
  subnetSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  securityGroupSelectorTerms:
    - tags:
        karpenter.sh/discovery: my-cluster
  blockDeviceMappings:
    - deviceName: /dev/xvda
      ebs:
        volumeSize: 200Gi
        volumeType: gp3

9. 비용 최적화¶

9.1 노드 비용 구조¶

클러스터 구성요소	비용	최적화 방법
EKS Control Plane	$0.10/시간 (~$72/월)	불가 (고정)
Worker 노드	EC2 비용	Spot, RI, Karpenter
NAT Gateway	$0.045/시간 + 데이터	VPC Endpoint 활용
Load Balancer	$0.025/시간~	통합, Ingress 활용
EBS Volumes	$0.10/GB/월 (gp3)	적절한 크기 설정

9.2 Spot 인스턴스 활용¶

# Karpenter로 Spot 우선 사용
apiVersion: karpenter.sh/v1beta1
kind: NodePool
metadata:
  name: spot-pool
spec:
  template:
    spec:
      requirements:
        - key: karpenter.sh/capacity-type
          operator: In
          values: ["spot", "on-demand"]  # Spot 우선 시도
        - key: node.kubernetes.io/instance-type
          operator: In
          values: ["m5.xlarge", "m5.2xlarge", "m6i.xlarge"]  # 다양한 타입
  limits:
    cpu: 1000
  disruption:
    consolidationPolicy: WhenUnderutilized
    consolidateAfter: 30s

9.3 리소스 최적화¶

# 적절한 requests/limits 설정
resources:
  requests:
    memory: "2Gi"   # 실제 사용량 기반
    cpu: "500m"
  limits:
    memory: "4Gi"   # requests의 1.5-2배
    cpu: "1"

# VPA (Vertical Pod Autoscaler) 사용
apiVersion: autoscaling.k8s.io/v1
kind: VerticalPodAutoscaler
metadata:
  name: ml-model-vpa
spec:
  targetRef:
    apiVersion: apps/v1
    kind: Deployment
    name: ml-model
  updatePolicy:
    updateMode: "Auto"  # 자동으로 리소스 조정

9.4 비용 모니터링¶

# Kubecost 설치
helm install kubecost kubecost/cost-analyzer \
  --namespace kubecost \
  --create-namespace

# 네임스페이스별 비용 확인
kubectl cost namespace --show-all-resources

# Pod별 비용 확인
kubectl cost pod -n ml-workloads

10. 실무 팁¶

10.1 GPU 워크로드 체크리스트¶

배포 전:
[ ] GPU 드라이버 DaemonSet 확인 (nvidia-device-plugin)
[ ] 노드에 GPU 할당 가능 확인 (nvidia.com/gpu)
[ ] 적절한 nodeSelector/tolerations 설정
[ ] /dev/shm 볼륨 마운트 (PyTorch DataLoader용)

배포 후:
[ ] GPU 사용률 모니터링 (DCGM Exporter)
[ ] OOM 발생 시 batch size 조정
[ ] 분산 학습 시 NCCL 통신 확인

10.2 문제 해결¶

# Pod이 Pending 상태일 때
kubectl describe pod <pod-name>
# Events에서 원인 확인: 리소스 부족, nodeSelector 불일치 등

# GPU가 인식 안 될 때
kubectl logs -n kube-system -l name=nvidia-device-plugin-ds
# 드라이버 버전, CUDA 버전 확인

# OOMKilled 발생 시
kubectl describe pod <pod-name> | grep -A5 "Last State"
# limits.memory 증가 또는 batch size 감소

# 이미지 풀 실패 시
kubectl describe pod <pod-name> | grep -A10 "Events"
# ECR 인증, 이미지 태그, 네트워크 확인

10.3 보안 Best Practices¶

# 1. 비root 사용자
securityContext:
  runAsNonRoot: true
  runAsUser: 1000
  fsGroup: 1000

# 2. 읽기 전용 파일시스템
securityContext:
  readOnlyRootFilesystem: true
volumeMounts:
  - name: tmp
    mountPath: /tmp
volumes:
  - name: tmp
    emptyDir: {}

# 3. 네트워크 정책
apiVersion: networking.k8s.io/v1
kind: NetworkPolicy
metadata:
  name: ml-model-policy
spec:
  podSelector:
    matchLabels:
      app: ml-model
  ingress:
    - from:
        - podSelector:
            matchLabels:
              role: api-gateway
      ports:
        - port: 8000

10.4 프로덕션 배포 체크리스트¶

[ ] Liveness/Readiness Probe 설정
[ ] Resource requests/limits 설정
[ ] PodDisruptionBudget 설정
[ ] HorizontalPodAutoscaler 설정
[ ] 로그 수집 (Fluentd/Fluent Bit)
[ ] 메트릭 수집 (Prometheus)
[ ] 알람 설정 (에러율, 레이턴시)
[ ] 롤링 업데이트 전략 설정
[ ] Secrets 암호화 (External Secrets Operator)

클러스터 구성요소	비용	최적화 방법
EKS Control Plane	\(0.10/시간 (~\)72/월)	불가 (고정)
Worker 노드	EC2 비용	Spot, RI, Karpenter
NAT Gateway	$0.045/시간 + 데이터	VPC Endpoint 활용
Load Balancer	$0.025/시간~	통합, Ingress 활용
EBS Volumes	$0.10/GB/월 (gp3)	적절한 크기 설정