콘텐츠로 이동
Data Prep
상세

CI/CD for ML

ML 프로젝트에서 코드, 데이터, 모델의 지속적 통합(CI)과 지속적 배포(CD)를 구현하는 방법.


1. ML CI/CD 특성

1.1 전통적 CI/CD vs ML CI/CD

항목 전통적 CI/CD ML CI/CD
테스트 대상 코드 코드 + 데이터 + 모델
빌드 결과물 바이너리/컨테이너 모델 아티팩트
테스트 시간 분 단위 시간~일 단위
검증 기준 테스트 통과 테스트 + 성능 메트릭
롤백 이전 버전 배포 이전 모델 배포

1.2 ML CI/CD 파이프라인 구성

[Code Change]     [Data Change]     [Schedule]
      |                 |               |
      v                 v               v
+--------------------------------------------------+
|                  CI Pipeline                      |
+--------------------------------------------------+
| 1. Code Lint & Format Check                      |
| 2. Unit Tests                                    |
| 3. Data Validation                               |
| 4. Feature Engineering Tests                     |
| 5. Model Training (Subset)                       |
| 6. Model Evaluation                              |
+--------------------------------------------------+
                      |
                      v
+--------------------------------------------------+
|                  CD Pipeline                      |
+--------------------------------------------------+
| 1. Full Model Training                           |
| 2. Model Registry Registration                   |
| 3. Staging Deployment                            |
| 4. Integration Tests                             |
| 5. Performance Tests                             |
| 6. Production Deployment                         |
+--------------------------------------------------+

2. CI 파이프라인 구현

2.1 GitHub Actions 기본 구조

# .github/workflows/ml-ci.yml
name: ML CI Pipeline

on:
  push:
    branches: [main, develop]
  pull_request:
    branches: [main]
  schedule:
    - cron: '0 2 * * *'  # 매일 새벽 2시

env:
  PYTHON_VERSION: '3.10'
  MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}

jobs:
  lint-and-format:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: |
          pip install ruff black isort mypy

      - name: Run linters
        run: |
          ruff check src/
          black --check src/
          isort --check-only src/
          mypy src/ --ignore-missing-imports

  unit-tests:
    runs-on: ubuntu-latest
    needs: lint-and-format
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install -r requirements.txt -r requirements-test.txt

      - name: Run unit tests
        run: pytest tests/unit/ -v --cov=src --cov-report=xml

      - name: Upload coverage
        uses: codecov/codecov-action@v3
        with:
          files: ./coverage.xml

  data-validation:
    runs-on: ubuntu-latest
    needs: lint-and-format
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install -r requirements.txt great_expectations

      - name: Pull data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull data/sample/

      - name: Run data validation
        run: python scripts/validate_data.py

  model-training-test:
    runs-on: ubuntu-latest
    needs: [unit-tests, data-validation]
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: ${{ env.PYTHON_VERSION }}

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Pull sample data
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull data/sample/

      - name: Train model (subset)
        run: |
          python src/train.py \
            --data-path data/sample/ \
            --max-epochs 2 \
            --experiment-name ci-test

      - name: Evaluate model
        run: python src/evaluate.py --model-path models/model.pkl

2.2 데이터 검증 스크립트

# scripts/validate_data.py
import great_expectations as gx
import pandas as pd
import sys

def validate_training_data():
    """학습 데이터 검증"""
    # Context 생성
    context = gx.get_context()

    # 데이터 로드
    df = pd.read_parquet("data/sample/train.parquet")

    # Expectation Suite 생성
    suite = context.add_expectation_suite("training_data_suite")

    # 기대값 정의
    expectations = [
        # 행 수 검증
        gx.expectations.ExpectTableRowCountToBeBetween(min_value=1000, max_value=1000000),

        # 필수 컬럼 존재
        gx.expectations.ExpectTableColumnsToMatchSet(
            column_set=["feature1", "feature2", "feature3", "target"]
        ),

        # 결측값 검증
        gx.expectations.ExpectColumnValuesToNotBeNull(column="target"),

        # 값 범위 검증
        gx.expectations.ExpectColumnValuesToBeBetween(
            column="feature1", min_value=0, max_value=100
        ),

        # 고유값 검증
        gx.expectations.ExpectColumnDistinctValuesToBeInSet(
            column="category", value_set=["A", "B", "C", "D"]
        ),
    ]

    # Validator 생성 및 검증
    validator = context.get_validator(
        batch_request=...,
        expectation_suite=suite,
    )

    results = validator.validate()

    if not results.success:
        print("Data validation failed!")
        for result in results.results:
            if not result.success:
                print(f"  - {result.expectation_config.expectation_type}: FAILED")
        sys.exit(1)

    print("Data validation passed!")

if __name__ == "__main__":
    validate_training_data()

2.3 모델 테스트

# tests/unit/test_model.py
import pytest
import numpy as np
import pandas as pd
from src.model import ModelTrainer, ModelPredictor

class TestModelTrainer:
    @pytest.fixture
    def sample_data(self):
        np.random.seed(42)
        X = pd.DataFrame({
            'feature1': np.random.randn(100),
            'feature2': np.random.randn(100),
        })
        y = (X['feature1'] + X['feature2'] > 0).astype(int)
        return X, y

    def test_trainer_initialization(self):
        trainer = ModelTrainer(n_estimators=10)
        assert trainer.model is not None

    def test_training_basic(self, sample_data):
        X, y = sample_data
        trainer = ModelTrainer(n_estimators=10)
        trainer.fit(X, y)
        assert trainer.is_fitted

    def test_prediction_shape(self, sample_data):
        X, y = sample_data
        trainer = ModelTrainer(n_estimators=10)
        trainer.fit(X, y)
        predictions = trainer.predict(X)
        assert len(predictions) == len(X)

    def test_prediction_values(self, sample_data):
        X, y = sample_data
        trainer = ModelTrainer(n_estimators=10)
        trainer.fit(X, y)
        predictions = trainer.predict(X)
        assert all(p in [0, 1] for p in predictions)

class TestModelPredictor:
    def test_load_model(self, tmp_path):
        # 모델 저장 후 로드 테스트
        model_path = tmp_path / "model.pkl"
        trainer = ModelTrainer(n_estimators=10)
        trainer.save(model_path)

        predictor = ModelPredictor(model_path)
        assert predictor.model is not None

    def test_prediction_latency(self, sample_data):
        import time
        X, y = sample_data
        trainer = ModelTrainer(n_estimators=10)
        trainer.fit(X, y)

        start = time.time()
        for _ in range(100):
            trainer.predict(X.iloc[[0]])
        elapsed = time.time() - start

        avg_latency = elapsed / 100
        assert avg_latency < 0.1  # 100ms 이하


# tests/integration/test_pipeline.py
def test_end_to_end_pipeline():
    """E2E 파이프라인 테스트"""
    from src.pipeline import run_pipeline

    result = run_pipeline(
        data_path="data/sample/",
        config_path="configs/test.yaml",
    )

    assert result["status"] == "success"
    assert result["metrics"]["accuracy"] > 0.5  # 최소 기준
    assert "model_path" in result

3. CD 파이프라인 구현

3.1 프로덕션 배포 워크플로우

# .github/workflows/ml-cd.yml
name: ML CD Pipeline

on:
  workflow_run:
    workflows: ["ML CI Pipeline"]
    types: [completed]
    branches: [main]

jobs:
  full-training:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: [self-hosted, gpu]
    steps:
      - uses: actions/checkout@v4

      - name: Set up Python
        uses: actions/setup-python@v5
        with:
          python-version: '3.10'

      - name: Install dependencies
        run: pip install -r requirements.txt

      - name: Pull full dataset
        env:
          AWS_ACCESS_KEY_ID: ${{ secrets.AWS_ACCESS_KEY_ID }}
          AWS_SECRET_ACCESS_KEY: ${{ secrets.AWS_SECRET_ACCESS_KEY }}
        run: dvc pull

      - name: Train full model
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
        run: |
          python src/train.py \
            --data-path data/train/ \
            --experiment-name production \
            --register-model

      - name: Save run ID
        run: echo "RUN_ID=$(cat run_id.txt)" >> $GITHUB_OUTPUT
        id: training

  staging-deployment:
    needs: full-training
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Deploy to staging
        env:
          KUBECONFIG: ${{ secrets.KUBECONFIG_STAGING }}
        run: |
          MODEL_VERSION=${{ needs.full-training.outputs.run_id }}
          envsubst < k8s/staging/deployment.yaml | kubectl apply -f -

      - name: Wait for deployment
        run: kubectl rollout status deployment/ml-model -n staging --timeout=300s

      - name: Run integration tests
        run: |
          pip install pytest requests
          pytest tests/integration/ -v --base-url=https://staging.api.example.com

  performance-tests:
    needs: staging-deployment
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Install k6
        run: |
          sudo gpg -k
          sudo gpg --no-default-keyring --keyring /usr/share/keyrings/k6-archive-keyring.gpg --keyserver hkp://keyserver.ubuntu.com:80 --recv-keys C5AD17C747E3415A3642D57D77C6C491D6AC1D69
          echo "deb [signed-by=/usr/share/keyrings/k6-archive-keyring.gpg] https://dl.k6.io/deb stable main" | sudo tee /etc/apt/sources.list.d/k6.list
          sudo apt-get update
          sudo apt-get install k6

      - name: Run load tests
        run: k6 run tests/performance/load_test.js

      - name: Check latency requirements
        run: python scripts/check_latency.py --threshold-p95=500

  production-deployment:
    needs: performance-tests
    runs-on: ubuntu-latest
    environment:
      name: production
      url: https://api.example.com
    steps:
      - uses: actions/checkout@v4

      - name: Promote model to production
        env:
          MLFLOW_TRACKING_URI: ${{ secrets.MLFLOW_TRACKING_URI }}
        run: |
          python scripts/promote_model.py \
            --model-name production-model \
            --from-stage Staging \
            --to-stage Production

      - name: Deploy to production
        env:
          KUBECONFIG: ${{ secrets.KUBECONFIG_PRODUCTION }}
        run: |
          kubectl apply -f k8s/production/deployment.yaml
          kubectl rollout status deployment/ml-model -n production --timeout=600s

      - name: Canary verification
        run: |
          # 10분간 에러율 모니터링
          python scripts/canary_check.py --duration=600 --error-threshold=0.01

      - name: Full rollout
        env:
          KUBECONFIG: ${{ secrets.KUBECONFIG_PRODUCTION }}
        run: |
          kubectl scale deployment/ml-model -n production --replicas=10

3.2 모델 프로모션 스크립트

# scripts/promote_model.py
import mlflow
from mlflow.tracking import MlflowClient
import argparse

def promote_model(model_name: str, from_stage: str, to_stage: str):
    """모델을 다음 스테이지로 프로모션"""
    client = MlflowClient()

    # 현재 스테이지의 모델 버전 찾기
    versions = client.get_latest_versions(model_name, stages=[from_stage])

    if not versions:
        raise ValueError(f"No model found in stage: {from_stage}")

    latest_version = versions[0]

    # 성능 검증
    run = client.get_run(latest_version.run_id)
    accuracy = run.data.metrics.get("accuracy", 0)

    MINIMUM_ACCURACY = 0.85
    if accuracy < MINIMUM_ACCURACY:
        raise ValueError(f"Model accuracy {accuracy} below threshold {MINIMUM_ACCURACY}")

    # 스테이지 전환
    client.transition_model_version_stage(
        name=model_name,
        version=latest_version.version,
        stage=to_stage,
        archive_existing_versions=True,
    )

    print(f"Promoted {model_name} v{latest_version.version} from {from_stage} to {to_stage}")
    print(f"Accuracy: {accuracy}")

if __name__ == "__main__":
    parser = argparse.ArgumentParser()
    parser.add_argument("--model-name", required=True)
    parser.add_argument("--from-stage", required=True)
    parser.add_argument("--to-stage", required=True)
    args = parser.parse_args()

    promote_model(args.model_name, args.from_stage, args.to_stage)

3.3 성능 테스트 (k6)

// tests/performance/load_test.js
import http from 'k6/http';
import { check, sleep } from 'k6';
import { Rate, Trend } from 'k6/metrics';

const errorRate = new Rate('errors');
const latencyTrend = new Trend('latency');

export const options = {
  stages: [
    { duration: '1m', target: 10 },   // Ramp up
    { duration: '3m', target: 50 },   // Stay at 50
    { duration: '1m', target: 100 },  // Spike
    { duration: '2m', target: 50 },   // Back to normal
    { duration: '1m', target: 0 },    // Ramp down
  ],
  thresholds: {
    http_req_duration: ['p(95)<500'],  // P95 < 500ms
    errors: ['rate<0.01'],              // Error rate < 1%
  },
};

const BASE_URL = __ENV.BASE_URL || 'https://staging.api.example.com';

export default function () {
  const payload = JSON.stringify({
    features: [0.5, 0.3, 0.8, 0.2, 0.6],
  });

  const params = {
    headers: {
      'Content-Type': 'application/json',
      'Authorization': `Bearer ${__ENV.API_TOKEN}`,
    },
  };

  const res = http.post(`${BASE_URL}/predict`, payload, params);

  const success = check(res, {
    'status is 200': (r) => r.status === 200,
    'response has prediction': (r) => JSON.parse(r.body).prediction !== undefined,
    'latency < 500ms': (r) => r.timings.duration < 500,
  });

  errorRate.add(!success);
  latencyTrend.add(res.timings.duration);

  sleep(1);
}

4. 배포 전략

4.1 Blue-Green 배포

# k8s/production/blue-green.yaml
apiVersion: v1
kind: Service
metadata:
  name: ml-model
spec:
  selector:
    app: ml-model
    version: blue  # 또는 green으로 전환
  ports:
    - port: 80
      targetPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-blue
spec:
  replicas: 5
  selector:
    matchLabels:
      app: ml-model
      version: blue
  template:
    metadata:
      labels:
        app: ml-model
        version: blue
    spec:
      containers:
        - name: ml-model
          image: ml-model:v1.0
          ports:
            - containerPort: 8000
---
apiVersion: apps/v1
kind: Deployment
metadata:
  name: ml-model-green
spec:
  replicas: 5
  selector:
    matchLabels:
      app: ml-model
      version: green
  template:
    metadata:
      labels:
        app: ml-model
        version: green
    spec:
      containers:
        - name: ml-model
          image: ml-model:v2.0
          ports:
            - containerPort: 8000

4.2 Canary 배포

# Istio를 이용한 Canary 배포
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
  name: ml-model
spec:
  hosts:
    - ml-model
  http:
    - match:
        - headers:
            canary:
              exact: "true"
      route:
        - destination:
            host: ml-model
            subset: canary
    - route:
        - destination:
            host: ml-model
            subset: stable
          weight: 95
        - destination:
            host: ml-model
            subset: canary
          weight: 5
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
  name: ml-model
spec:
  host: ml-model
  subsets:
    - name: stable
      labels:
        version: stable
    - name: canary
      labels:
        version: canary

5. 모범 사례

5.1 브랜치 전략

cicd diagram 1

5.2 체크리스트

CI 체크리스트:
[ ] 코드 린트 통과
[ ] 단위 테스트 통과 (커버리지 > 80%)
[ ] 데이터 검증 통과
[ ] 모델 학습 성공 (샘플 데이터)
[ ] 모델 평가 기준 통과

CD 체크리스트:
[ ] 전체 데이터 학습 완료
[ ] 모델 레지스트리 등록
[ ] Staging 배포 성공
[ ] 통합 테스트 통과
[ ] 성능 테스트 통과 (P95 < 500ms)
[ ] Canary 검증 통과 (Error < 1%)
[ ] Production 배포 승인

6. 트러블슈팅 가이드

6.1 일반적인 CI/CD 실패 케이스

실패 유형 원인 해결책
데이터 pull 실패 DVC 인증 만료 Secrets 갱신, 토큰 rotation
학습 시간 초과 데이터 크기 증가 샘플링, 병렬화, 타임아웃 증가
메모리 부족 모델/데이터 크기 Runner 스펙 업그레이드, 배치 처리
테스트 불안정 랜덤 시드 미고정 PYTHONHASHSEED, numpy seed 고정
배포 실패 이미지 pull 오류 Registry 인증 확인, 태그 검증

6.2 데이터 관련 실패

# 데이터 검증 실패 처리
- name: Validate and pull data
  id: data_validation
  continue-on-error: true
  run: |
    dvc pull data/train/ || {
      echo "DVC pull failed, attempting fallback"
      # 캐시된 데이터 사용
      if [ -d ".dvc/cache" ]; then
        dvc checkout
      else
        echo "No cached data available"
        exit 1
      fi
    }

    # 데이터 무결성 검증
    python scripts/validate_data.py || {
      echo "Data validation failed"
      python scripts/generate_data_report.py > data_report.md
      exit 1
    }

- name: Handle data validation failure
  if: steps.data_validation.outcome == 'failure'
  run: |
    # Slack 알림
    curl -X POST $SLACK_WEBHOOK -d '{
      "text": "Data validation failed in CI pipeline",
      "attachments": [{"text": "Check data report for details"}]
    }'

6.3 모델 학습 실패 디버깅

# scripts/debug_training.py
import sys
import traceback
import json
from datetime import datetime

def training_with_diagnostics():
    """학습 실패 시 진단 정보 수집"""
    diagnostics = {
        "timestamp": datetime.now().isoformat(),
        "python_version": sys.version,
        "environment": {},
        "data_stats": {},
        "error": None,
    }

    try:
        # 환경 정보 수집
        import torch
        import numpy as np
        diagnostics["environment"] = {
            "torch_version": torch.__version__,
            "cuda_available": torch.cuda.is_available(),
            "cuda_version": torch.version.cuda if torch.cuda.is_available() else None,
            "gpu_count": torch.cuda.device_count() if torch.cuda.is_available() else 0,
        }

        # 데이터 통계
        import pandas as pd
        df = pd.read_parquet("data/train.parquet")
        diagnostics["data_stats"] = {
            "rows": len(df),
            "columns": list(df.columns),
            "null_counts": df.isnull().sum().to_dict(),
            "dtypes": {str(k): str(v) for k, v in df.dtypes.items()},
        }

        # 실제 학습
        from src.train import train_model
        train_model()

    except Exception as e:
        diagnostics["error"] = {
            "type": type(e).__name__,
            "message": str(e),
            "traceback": traceback.format_exc(),
        }

        # 진단 정보 저장
        with open("training_diagnostics.json", "w") as f:
            json.dump(diagnostics, f, indent=2)

        raise

    return diagnostics

if __name__ == "__main__":
    training_with_diagnostics()
# CI에서 진단 정보 활용
- name: Train with diagnostics
  id: training
  continue-on-error: true
  run: python scripts/debug_training.py

- name: Upload diagnostics on failure
  if: steps.training.outcome == 'failure'
  uses: actions/upload-artifact@v3
  with:
    name: training-diagnostics
    path: training_diagnostics.json

- name: Comment PR with failure details
  if: steps.training.outcome == 'failure' && github.event_name == 'pull_request'
  uses: actions/github-script@v6
  with:
    script: |
      const fs = require('fs');
      const diagnostics = JSON.parse(fs.readFileSync('training_diagnostics.json'));

      github.rest.issues.createComment({
        owner: context.repo.owner,
        repo: context.repo.repo,
        issue_number: context.issue.number,
        body: `## Training Failed

        **Error:** ${diagnostics.error.type}: ${diagnostics.error.message}

        **Data Stats:**
        - Rows: ${diagnostics.data_stats.rows}
        - Columns: ${diagnostics.data_stats.columns.length}

        <details>
        <summary>Full Traceback</summary>

        \`\`\`
        ${diagnostics.error.traceback}
        \`\`\`
        </details>`
      });

6.4 배포 롤백 자동화

# .github/workflows/auto-rollback.yml
name: Auto Rollback on Failure

on:
  workflow_run:
    workflows: ["ML CD Pipeline"]
    types: [completed]

jobs:
  check-and-rollback:
    if: ${{ github.event.workflow_run.conclusion == 'success' }}
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4

      - name: Wait for deployment stabilization
        run: sleep 300  # 5분 대기

      - name: Check production health
        id: health_check
        run: |
          # 에러율 체크
          ERROR_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=rate(model_errors_total[5m])" | jq '.data.result[0].value[1]')

          # P95 지연시간 체크
          LATENCY=$(curl -s "http://prometheus:9090/api/v1/query?query=histogram_quantile(0.95,rate(model_latency_bucket[5m]))" | jq '.data.result[0].value[1]')

          echo "error_rate=$ERROR_RATE" >> $GITHUB_OUTPUT
          echo "latency=$LATENCY" >> $GITHUB_OUTPUT

          # 임계값 검사
          if (( $(echo "$ERROR_RATE > 0.05" | bc -l) )); then
            echo "needs_rollback=true" >> $GITHUB_OUTPUT
          elif (( $(echo "$LATENCY > 2.0" | bc -l) )); then
            echo "needs_rollback=true" >> $GITHUB_OUTPUT
          else
            echo "needs_rollback=false" >> $GITHUB_OUTPUT
          fi

      - name: Rollback deployment
        if: steps.health_check.outputs.needs_rollback == 'true'
        env:
          KUBECONFIG: ${{ secrets.KUBECONFIG }}
        run: |
          # 이전 버전으로 롤백
          kubectl rollout undo deployment/ml-model -n production

          # 롤백 완료 대기
          kubectl rollout status deployment/ml-model -n production

      - name: Notify rollback
        if: steps.health_check.outputs.needs_rollback == 'true'
        run: |
          curl -X POST ${{ secrets.SLACK_WEBHOOK }} -d '{
            "text": ":warning: Production rollback executed",
            "attachments": [{
              "color": "danger",
              "fields": [
                {"title": "Error Rate", "value": "${{ steps.health_check.outputs.error_rate }}", "short": true},
                {"title": "P95 Latency", "value": "${{ steps.health_check.outputs.latency }}s", "short": true}
              ]
            }]
          }'

6.5 Flaky 테스트 처리

# 불안정한 테스트 재시도 전략
- name: Run tests with retry
  uses: nick-invision/retry@v2
  with:
    timeout_minutes: 30
    max_attempts: 3
    retry_on: error
    command: |
      # 랜덤 시드 고정
      export PYTHONHASHSEED=42

      # 테스트 실행
      pytest tests/ -v \
        --randomly-seed=42 \
        --tb=short \
        -x  # 첫 실패에서 중단

# 또는 pytest-retry 사용
- name: Run tests with pytest-retry
  run: |
    pip install pytest-retry
    pytest tests/ -v --retries 2 --retry-delay 5

6.6 리소스 관리

# GPU 리소스 효율적 사용
jobs:
  train:
    runs-on: [self-hosted, gpu]
    timeout-minutes: 360  # 6시간 제한

    steps:
      - name: Check GPU availability
        run: |
          nvidia-smi
          # GPU 메모리 정리
          python -c "import torch; torch.cuda.empty_cache()"

      - name: Train with memory management
        run: |
          # 메모리 효율적 학습
          python src/train.py \
            --gradient-checkpointing \
            --fp16 \
            --batch-size 16 \
            --gradient-accumulation-steps 4
        env:
          PYTORCH_CUDA_ALLOC_CONF: max_split_size_mb:512

      - name: Cleanup GPU memory
        if: always()
        run: |
          python -c "import torch; torch.cuda.empty_cache()"
          nvidia-smi

7. 실무 모범 사례

7.1 CI/CD 파이프라인 최적화

# 캐싱 전략
- name: Cache dependencies
  uses: actions/cache@v3
  with:
    path: |
      ~/.cache/pip
      ~/.cache/huggingface
      .dvc/cache
    key: ${{ runner.os }}-pip-${{ hashFiles('**/requirements.txt') }}
    restore-keys: |
      ${{ runner.os }}-pip-

# 병렬 실행
jobs:
  lint:
    runs-on: ubuntu-latest
    # 빠른 작업

  test:
    runs-on: ubuntu-latest
    needs: lint
    strategy:
      matrix:
        test-group: [unit, integration, model]
    # 테스트 병렬화

  train:
    runs-on: [self-hosted, gpu]
    needs: test
    # lint, test 통과 후에만 GPU 사용

7.2 PR 체크리스트 자동화

# .github/workflows/pr-checks.yml
name: PR Quality Checks

on:
  pull_request:
    types: [opened, synchronize]

jobs:
  checklist:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
        with:
          fetch-depth: 0

      - name: Check model changes
        id: model_changes
        run: |
          if git diff --name-only origin/main...HEAD | grep -E "^(src/model|models/)" > /dev/null; then
            echo "model_changed=true" >> $GITHUB_OUTPUT
          fi

      - name: Check data changes
        id: data_changes
        run: |
          if git diff --name-only origin/main...HEAD | grep -E "\.dvc$" > /dev/null; then
            echo "data_changed=true" >> $GITHUB_OUTPUT
          fi

      - name: Comment checklist
        uses: actions/github-script@v6
        with:
          script: |
            const modelChanged = '${{ steps.model_changes.outputs.model_changed }}' === 'true';
            const dataChanged = '${{ steps.data_changes.outputs.data_changed }}' === 'true';

            let checklist = '## PR Checklist\n\n';

            if (modelChanged) {
              checklist += '### Model Changes Detected\n';
              checklist += '- [ ] Model performance validated\n';
              checklist += '- [ ] Backward compatibility checked\n';
              checklist += '- [ ] Model size acceptable\n';
            }

            if (dataChanged) {
              checklist += '### Data Changes Detected\n';
              checklist += '- [ ] Data quality validated\n';
              checklist += '- [ ] DVC push completed\n';
              checklist += '- [ ] Data documentation updated\n';
            }

            checklist += '\n### General\n';
            checklist += '- [ ] Tests pass locally\n';
            checklist += '- [ ] Code review completed\n';

            github.rest.issues.createComment({
              owner: context.repo.owner,
              repo: context.repo.repo,
              issue_number: context.issue.number,
              body: checklist
            });

7.3 환경별 설정 관리

# configs/environments/
# production.yaml
model:
  batch_size: 64
  precision: fp16

deployment:
  replicas: 4
  resources:
    gpu: 2
    memory: 32Gi

monitoring:
  alert_threshold_latency: 500ms
  alert_threshold_error_rate: 0.01

# staging.yaml
model:
  batch_size: 32
  precision: fp32

deployment:
  replicas: 1
  resources:
    gpu: 1
    memory: 16Gi
# CI에서 환경 설정 로드
- name: Load environment config
  run: |
    ENV=${{ github.ref == 'refs/heads/main' && 'production' || 'staging' }}
    cp configs/environments/${ENV}.yaml config.yaml

    # 환경 변수로 내보내기
    yq eval -o=shell config.yaml > env_vars.sh
    source env_vars.sh

참고 자료