LLM Observability 아키텍처¶

LLM 프로덕션 시스템의 관측 가능성(Observability) 설계 가이드. 트레이싱, 메트릭, 로깅, 평가를 통합하여 LLM 애플리케이션의 품질과 비용을 체계적으로 관리한다.

개요¶

전통적인 소프트웨어와 달리 LLM 시스템은 비결정적(non-deterministic) 출력을 생성한다. 동일한 입력에도 다른 응답이 나올 수 있고, 모델 업데이트나 프롬프트 변경이 예측 불가능한 품질 변화를 일으킨다. 이러한 특성 때문에 LLM 시스템에서는 기존 APM을 넘어선 전용 Observability 체계가 필수적이다.

LLM Observability가 필요한 이유¶

문제	설명	Observability 해결 방식
Hallucination	사실과 다른 내용 생성	Faithfulness 메트릭 추적
품질 저하	모델/프롬프트 변경 후 성능 하락	A/B 비교, regression 탐지
비용 폭증	불필요한 토큰 사용, 비효율적 호출	토큰/비용 실시간 모니터링
느린 응답	TTFT, TPS 저하	Latency 분포 추적
RAG 실패	검색 품질 저하, 부적절한 컨텍스트	Retrieval quality 측정
디버깅 어려움	복잡한 체인/에이전트 실패 원인 파악	구간별 트레이싱

핵심 구성요소¶

LLM Observability는 4개의 축(Pillar)으로 구성된다.

┌─────────────────────────────────────────────────────────────┐
│                LLM Observability Pillars                     │
│                                                             │
│  ┌──────────────┐  ┌──────────────┐  ┌──────────────┐      │
│  │   Tracing    │  │   Metrics    │  │   Logging    │      │
│  │              │  │              │  │              │      │
│  │ - Span 기반  │  │ - Latency    │  │ - Input/     │      │
│  │ - 호출 체인  │  │ - Token      │  │   Output     │      │
│  │ - 구간 분석  │  │ - Cost       │  │ - Error      │      │
│  │ - 의존성 맵  │  │ - Quality    │  │ - Metadata   │      │
│  └──────┬───────┘  └──────┬───────┘  └──────┬───────┘      │
│         │                 │                 │               │
│         └────────────┬────┴────────────┬────┘               │
│                      ▼                 ▼                    │
│              ┌──────────────┐  ┌──────────────┐            │
│              │  Evaluation  │  │  Alerting    │            │
│              │              │  │              │            │
│              │ - Online     │  │ - Drift      │            │
│              │ - Offline    │  │ - Threshold  │            │
│              │ - Human      │  │ - Anomaly    │            │
│              └──────────────┘  └──────────────┘            │
└─────────────────────────────────────────────────────────────┘

트레이싱 아키텍처¶

OpenTelemetry 기반 LLM Trace 설계¶

LLM 호출을 OpenTelemetry의 Trace/Span 개념으로 구조화한다. 하나의 사용자 요청이 여러 LLM 호출, 검색, 후처리를 거치는 과정을 계층적으로 추적한다.

Trace: user_query_12345
│
├── Span: intent_classification (23ms)
│   ├── attr: model=gpt-4o-mini
│   ├── attr: input_tokens=150
│   └── attr: output_tokens=5
│
├── Span: rag_retrieval (180ms)
│   ├── Span: embedding (45ms)
│   │   └── attr: model=text-embedding-3-small
│   ├── Span: vector_search (120ms)
│   │   └── attr: top_k=20, results=20
│   └── Span: reranking (15ms)
│       └── attr: model=cohere-rerank-v3, top_n=5
│
├── Span: llm_generation (1200ms)
│   ├── attr: model=gpt-4o
│   ├── attr: input_tokens=2500
│   ├── attr: output_tokens=350
│   ├── attr: ttft=280ms
│   └── attr: temperature=0.1
│
└── Span: post_processing (15ms)
    └── attr: format=markdown

Span 속성 설계¶

속성	타입	설명
`llm.model`	string	사용 모델명
`llm.provider`	string	OpenAI, Anthropic 등
`llm.input_tokens`	int	입력 토큰 수
`llm.output_tokens`	int	출력 토큰 수
`llm.total_tokens`	int	총 토큰 수
`llm.cost_usd`	float	호출 비용 (USD)
`llm.ttft_ms`	float	Time to First Token
`llm.tps`	float	Tokens per Second
`llm.temperature`	float	샘플링 온도
`llm.prompt_template`	string	사용된 프롬프트 템플릿 ID
`rag.query`	string	검색 쿼리
`rag.num_results`	int	검색 결과 수
`rag.relevance_score`	float	평균 관련성 점수

주요 도구 비교¶

Observability 플랫폼 비교표¶

도구	유형	강점	약점	가격 모델	적합한 팀
LangSmith	SaaS	LangChain 네이티브, 평가 통합	LangChain 종속성	Free tier + 유료	LangChain 사용 팀
Langfuse	OSS/SaaS	오픈소스, 셀프호스팅 가능	생태계 작음	무료(셀프) / 유료(클라우드)	커스텀 필요 팀
Phoenix (Arize)	OSS	노트북 통합, 로컬 분석	프로덕션 스케일 제한	무료	실험/연구 팀
OpenLLMetry	OSS	OpenTelemetry 네이티브	UI 없음 (백엔드 연동)	무료	OTel 인프라 보유 팀
Helicone	SaaS	프록시 방식, 코드 변경 최소	프록시 레이턴시	Free tier + 유료	빠른 도입 필요 팀
W&B Weave	SaaS	실험 추적 통합, ML 워크플로	LLM 전용 기능 제한적	Free tier + 유료	ML 팀 (W&B 기존 사용)

도구 선택 의사결정 흐름¶

셀프호스팅 필수?
├── Yes → Langfuse (Docker/K8s)
└── No
    ├── LangChain 사용? → LangSmith
    ├── 기존 OTel 인프라? → OpenLLMetry + Grafana
    ├── 코드 변경 최소화? → Helicone (프록시)
    └── 실험/연구 목적? → Phoenix (Arize)

메트릭 설계¶

Latency 메트릭¶

메트릭	정의	측정 방법	목표
TTFT (Time to First Token)	요청 후 첫 토큰까지 시간	스트리밍 첫 chunk 타임스탬프	< 500ms
TPS (Tokens per Second)	초당 생성 토큰 수	output_tokens / generation_time	> 30 TPS
E2E Latency	전체 요청-응답 시간	요청 시작 ~ 응답 완료	< 3s (RAG 포함)
Retrieval Latency	검색 소요 시간	embedding + search + rerank	< 500ms

Token 및 비용 메트릭¶

메트릭	단위	집계	용도
input_tokens	count	모델별, 기능별, 시간별	프롬프트 최적화
output_tokens	count	모델별, 기능별, 시간별	출력 제한 조정
cost_usd	USD	모델별, 팀별, 기능별	예산 관리
cache_hit_rate	ratio	시간별	캐시 효율
error_rate	ratio	모델별, 에러타입별	안정성 모니터링

품질 메트릭¶

메트릭	범위	측정 방법	설명
Faithfulness	0-1	NLI 기반 자동 평가	컨텍스트 대비 사실 정확도
Relevance	0-1	LLM-as-Judge	질문 대비 답변 관련성
Completeness	0-1	LLM-as-Judge	답변의 완전성
Toxicity	0-1	분류 모델	유해 콘텐츠 여부
User Satisfaction	1-5	사용자 피드백 (thumbs up/down)	실제 만족도

평가 파이프라인¶

3단계 평가 체계¶

┌─────────────────────────────────────────────────────────┐
│                 Evaluation Pipeline                      │
│                                                         │
│  ┌──────────────┐                                       │
│  │ Online Eval  │  실시간, 모든 요청                    │
│  │              │  - Latency 체크                        │
│  │              │  - Token count                         │
│  │              │  - 기본 품질 체크 (길이, 형식)         │
│  │              │  - 유해성 필터                          │
│  └──────┬───────┘                                       │
│         ▼                                               │
│  ┌──────────────┐                                       │
│  │ Offline Eval │  배치, 샘플링 (일 1회)                │
│  │              │  - LLM-as-Judge                        │
│  │              │  - RAG 품질 (RAGAS)                    │
│  │              │  - Golden dataset 비교                 │
│  │              │  - Regression 테스트                   │
│  └──────┬───────┘                                       │
│         ▼                                               │
│  ┌──────────────┐                                       │
│  │  Human Eval  │  주기적, 핵심 케이스 (주 1회)         │
│  │              │  - 전문가 리뷰                         │
│  │              │  - Edge case 검증                      │
│  │              │  - 신규 기능 QA                        │
│  └──────────────┘                                       │
└─────────────────────────────────────────────────────────┘

Offline 평가 구현¶

from ragas import evaluate
from ragas.metrics import (
    faithfulness,
    answer_relevancy,
    context_precision,
    context_recall,
)

def run_offline_eval(dataset: list[dict]) -> dict:
    """일일 오프라인 평가 실행

    Args:
        dataset: [{"question": ..., "answer": ..., 
                   "contexts": [...], "ground_truth": ...}]
    """
    results = evaluate(
        dataset=dataset,
        metrics=[
            faithfulness,
            answer_relevancy,
            context_precision,
            context_recall,
        ],
    )

    return {
        "faithfulness": results["faithfulness"],
        "relevancy": results["answer_relevancy"],
        "context_precision": results["context_precision"],
        "context_recall": results["context_recall"],
        "timestamp": datetime.now().isoformat(),
    }

RAG Observability¶

RAG 파이프라인의 각 단계별 품질을 측정한다.

RAG 메트릭 체계¶

Query → [Retrieval] → [Context] → [Generation] → Answer
         │               │              │
         ▼               ▼              ▼
    - Recall@k      - Relevance    - Faithfulness
    - Precision@k   - Coverage     - Completeness
    - MRR           - Noise ratio  - Hallucination rate
    - nDCG          - Redundancy   - Citation accuracy

단계별 측정 코드¶

class RAGObserver:
    """RAG 파이프라인 관측"""

    def __init__(self, metrics_client):
        self.metrics = metrics_client

    def observe_retrieval(self, query: str, results: list, 
                          ground_truth: list = None):
        """검색 단계 관측"""
        self.metrics.record({
            "rag.retrieval.num_results": len(results),
            "rag.retrieval.avg_score": np.mean([r.score for r in results]),
            "rag.retrieval.max_score": max(r.score for r in results),
            "rag.retrieval.min_score": min(r.score for r in results),
        })

        if ground_truth:
            retrieved_ids = {r.doc_id for r in results}
            relevant_ids = set(ground_truth)
            recall = len(retrieved_ids & relevant_ids) / len(relevant_ids)
            precision = len(retrieved_ids & relevant_ids) / len(retrieved_ids)
            self.metrics.record({
                "rag.retrieval.recall": recall,
                "rag.retrieval.precision": precision,
            })

    def observe_context(self, query: str, contexts: list[str]):
        """컨텍스트 구성 단계 관측"""
        total_tokens = sum(count_tokens(c) for c in contexts)
        self.metrics.record({
            "rag.context.num_chunks": len(contexts),
            "rag.context.total_tokens": total_tokens,
            "rag.context.avg_chunk_tokens": total_tokens / len(contexts),
        })

    def observe_generation(self, query: str, answer: str, 
                           contexts: list[str]):
        """생성 단계 관측"""
        # Faithfulness: 답변이 컨텍스트에 근거하는지
        faith_score = self._check_faithfulness(answer, contexts)
        self.metrics.record({
            "rag.generation.faithfulness": faith_score,
            "rag.generation.answer_length": len(answer),
        })

    def _check_faithfulness(self, answer: str, contexts: list[str]) -> float:
        """NLI 기반 faithfulness 체크"""
        claims = self._extract_claims(answer)
        supported = 0
        for claim in claims:
            context_text = " ".join(contexts)
            if self._nli_check(claim, context_text):
                supported += 1
        return supported / len(claims) if claims else 1.0

실시간 대시보드 설계¶

Grafana + Prometheus 구성¶

# prometheus.yml
scrape_configs:
  - job_name: 'llm-app'
    scrape_interval: 15s
    static_configs:
      - targets: ['llm-app:8000']

# 커스텀 메트릭 정의
# metrics.py
from prometheus_client import Counter, Histogram, Gauge

llm_request_total = Counter(
    'llm_request_total',
    'Total LLM requests',
    ['model', 'endpoint', 'status']
)

llm_latency = Histogram(
    'llm_latency_seconds',
    'LLM request latency',
    ['model', 'endpoint'],
    buckets=[0.1, 0.25, 0.5, 1.0, 2.0, 5.0, 10.0]
)

llm_tokens = Counter(
    'llm_tokens_total',
    'Total tokens used',
    ['model', 'direction']  # direction: input/output
)

llm_cost = Counter(
    'llm_cost_usd_total',
    'Total cost in USD',
    ['model', 'team', 'feature']
)

rag_faithfulness = Gauge(
    'rag_faithfulness_score',
    'RAG faithfulness score (rolling avg)',
    ['pipeline']
)

대시보드 패널 구성¶

패널	메트릭	시각화	알림 조건
요청량	`llm_request_total`	시계열 그래프	급증/급감
응답 시간	`llm_latency_seconds`	히스토그램	P95 > 5s
토큰 사용량	`llm_tokens_total`	스택형 바 차트	일일 한도 80%
비용 추적	`llm_cost_usd_total`	파이 차트 + 시계열	일일 예산 초과
에러율	`llm_request_total{status=error}`	비율 게이지	> 5%
RAG 품질	`rag_faithfulness_score`	게이지 + 트렌드	< 0.8
모델별 비교	복합	테이블	-

알림 및 이상 탐지¶

알림 규칙¶

# alerting_rules.yml
groups:
  - name: llm_alerts
    rules:
      - alert: HighLatency
        expr: histogram_quantile(0.95, llm_latency_seconds) > 5
        for: 5m
        labels:
          severity: warning
        annotations:
          summary: "LLM P95 latency > 5s"

      - alert: HighErrorRate
        expr: >
          rate(llm_request_total{status="error"}[5m]) 
          / rate(llm_request_total[5m]) > 0.05
        for: 3m
        labels:
          severity: critical
        annotations:
          summary: "LLM error rate > 5%"

      - alert: CostBudgetExceeded
        expr: >
          sum(increase(llm_cost_usd_total[24h])) > 100
        labels:
          severity: warning
        annotations:
          summary: "Daily LLM cost exceeded $100"

      - alert: QualityDegradation
        expr: rag_faithfulness_score < 0.8
        for: 15m
        labels:
          severity: warning
        annotations:
          summary: "RAG faithfulness dropped below 0.8"

Drift Detection¶

품질 드리프트를 탐지하기 위한 통계적 접근:

import numpy as np
from scipy import stats

class DriftDetector:
    """품질 메트릭 드리프트 탐지"""

    def __init__(self, window_size: int = 100, threshold: float = 0.05):
        self.window_size = window_size
        self.threshold = threshold
        self.baseline: list[float] = []
        self.current: list[float] = []

    def set_baseline(self, scores: list[float]):
        """기준 분포 설정 (배포/업데이트 직후)"""
        self.baseline = scores

    def add_score(self, score: float) -> dict | None:
        """새로운 점수 추가 및 드리프트 체크"""
        self.current.append(score)

        if len(self.current) < self.window_size:
            return None

        window = self.current[-self.window_size:]
        stat, p_value = stats.ks_2samp(self.baseline, window)

        if p_value < self.threshold:
            return {
                "drift_detected": True,
                "p_value": p_value,
                "baseline_mean": np.mean(self.baseline),
                "current_mean": np.mean(window),
                "delta": np.mean(window) - np.mean(self.baseline),
            }
        return None

구현: Python Decorator 패턴¶

LLM 호출 트레이싱 데코레이터¶

import time
import functools
import uuid
from opentelemetry import trace
from opentelemetry.trace import StatusCode

tracer = trace.get_tracer("llm-app")

# 모델별 가격 (USD per 1K tokens)
PRICING = {
    "gpt-4o": {"input": 0.0025, "output": 0.01},
    "gpt-4o-mini": {"input": 0.00015, "output": 0.0006},
    "claude-3.5-sonnet": {"input": 0.003, "output": 0.015},
}

def trace_llm_call(model: str = None, feature: str = "default"):
    """LLM 호출 트레이싱 데코레이터

    사용법:
        @trace_llm_call(model="gpt-4o", feature="rag")
        async def generate_answer(query, context):
            ...
    """
    def decorator(func):
        @functools.wraps(func)
        async def wrapper(*args, **kwargs):
            span_name = f"llm.{func.__name__}"

            with tracer.start_as_current_span(span_name) as span:
                span.set_attribute("llm.model", model or "unknown")
                span.set_attribute("llm.feature", feature)
                span.set_attribute("llm.call_id", str(uuid.uuid4()))

                start_time = time.perf_counter()

                try:
                    result = await func(*args, **kwargs)
                    elapsed = time.perf_counter() - start_time

                    # 응답에서 메트릭 추출
                    if hasattr(result, 'usage'):
                        input_tokens = result.usage.prompt_tokens
                        output_tokens = result.usage.completion_tokens

                        span.set_attribute("llm.input_tokens", input_tokens)
                        span.set_attribute("llm.output_tokens", output_tokens)
                        span.set_attribute("llm.total_tokens", 
                                         input_tokens + output_tokens)

                        # 비용 계산
                        if model in PRICING:
                            cost = (
                                input_tokens / 1000 * PRICING[model]["input"]
                                + output_tokens / 1000 * PRICING[model]["output"]
                            )
                            span.set_attribute("llm.cost_usd", cost)

                            # Prometheus 메트릭
                            llm_cost.labels(
                                model=model, team="default", feature=feature
                            ).inc(cost)

                    span.set_attribute("llm.latency_ms", elapsed * 1000)
                    llm_latency.labels(
                        model=model, endpoint=func.__name__
                    ).observe(elapsed)

                    llm_request_total.labels(
                        model=model, endpoint=func.__name__, status="success"
                    ).inc()

                    span.set_status(StatusCode.OK)
                    return result

                except Exception as e:
                    span.set_status(StatusCode.ERROR, str(e))
                    span.record_exception(e)
                    llm_request_total.labels(
                        model=model, endpoint=func.__name__, status="error"
                    ).inc()
                    raise

        return wrapper
    return decorator


# 사용 예시
@trace_llm_call(model="gpt-4o", feature="rag")
async def generate_rag_answer(query: str, context: str):
    response = await openai_client.chat.completions.create(
        model="gpt-4o",
        messages=[
            {"role": "system", "content": f"Context: {context}"},
            {"role": "user", "content": query},
        ],
        temperature=0.1,
    )
    return response

비용 모니터링¶

비용 추적 차원¶

비용 집계 차원:
├── 모델별: gpt-4o, claude-3.5-sonnet, ...
├── 팀별: data-team, product-team, ...
├── 기능별: rag, sql-gen, classification, ...
├── 환경별: production, staging, development
└── 시간별: 시간, 일, 주, 월

비용 리포트 생성¶

class CostReporter:
    """LLM 비용 리포트 생성"""

    def __init__(self, db):
        self.db = db

    def daily_report(self, date: str) -> dict:
        """일일 비용 리포트"""
        query = """
        SELECT 
            model,
            feature,
            team,
            COUNT(*) as request_count,
            SUM(input_tokens) as total_input_tokens,
            SUM(output_tokens) as total_output_tokens,
            SUM(cost_usd) as total_cost
        FROM llm_traces
        WHERE DATE(created_at) = :date
        GROUP BY model, feature, team
        ORDER BY total_cost DESC
        """
        rows = self.db.execute(query, {"date": date})

        return {
            "date": date,
            "total_cost": sum(r.total_cost for r in rows),
            "by_model": self._group_by(rows, "model"),
            "by_feature": self._group_by(rows, "feature"),
            "by_team": self._group_by(rows, "team"),
            "top_expensive_calls": self._top_calls(date, limit=10),
        }

    def cost_forecast(self, days: int = 30) -> dict:
        """비용 예측 (최근 7일 기반)"""
        recent = self._get_recent_daily_costs(7)
        avg_daily = np.mean(recent)

        return {
            "avg_daily_cost": avg_daily,
            "projected_monthly": avg_daily * days,
            "trend": "increasing" if recent[-1] > recent[0] else "stable",
        }

비용 최적화 체크리스트¶

항목	방법	절감 효과
모델 다운그레이드	간단한 작업에 mini 모델	50-90%
프롬프트 압축	불필요한 지시문 제거	10-30%
캐싱	동일 쿼리 캐시	20-60%
배치 처리	Batch API 활용	50%
토큰 제한	max_tokens 설정	10-20%
라우팅	쿼리 복잡도별 모델 분기	30-50%

프로덕션 체크리스트¶

도입 전 확인해야 할 항목:

단계	항목	확인
기본	모든 LLM 호출에 trace 적용
기본	input/output 로깅 (PII 마스킹)
기본	latency, token, error 메트릭 수집
중급	비용 모니터링 + 예산 알림
중급	RAG 품질 메트릭 (faithfulness, relevance)
중급	대시보드 구성
고급	자동 평가 파이프라인 (일 1회)
고급	드리프트 탐지
고급	A/B 테스트 인프라

참고 자료¶

자료	링크
OpenTelemetry Semantic Conventions for GenAI	https://opentelemetry.io/docs/specs/semconv/gen-ai/
RAGAS Documentation	https://docs.ragas.io/
Langfuse Self-hosting Guide	https://langfuse.com/docs/deployment/self-host
OpenLLMetry	https://github.com/traceloop/openllmetry

최종 업데이트: 2026-03-25