생존 분석 개요¶

생존 분석(Survival Analysis)은 이벤트 발생까지의 시간을 분석하는 통계적 방법론. 중도절단(Censoring) 데이터를 적절히 처리하는 것이 핵심. 의료, 고객 이탈, 장비 고장 예측 등에 활용됨.

핵심 개념¶

생존 함수¶

\[S(t) = P(T > t) = 1 - F(t)\]

시점 \(t\) 이후까지 생존할 확률.

위험 함수 (Hazard Function)¶

\[h(t) = \lim_{\Delta t \to 0} \frac{P(t \leq T < t + \Delta t | T \geq t)}{\Delta t} = \frac{f(t)}{S(t)}\]

시점 \(t\)에서 순간 이벤트 발생률.

누적 위험 함수¶

\[H(t) = \int_0^t h(u) du = -\ln S(t)\]

중도절단 (Censoring)¶

유형	설명	예시
Right Censoring	관찰 종료 시 이벤트 미발생	연구 종료 시 생존
Left Censoring	이벤트 발생 시점 이전부터 관찰	감염 시점 불명
Interval Censoring	구간 내 발생 알지만 정확한 시점 모름	주기적 검진

알고리즘 분류 체계¶

Survival Analysis
├── Non-parametric
│   ├── Kaplan-Meier Estimator
│   ├── Nelson-Aalen Estimator
│   └── Log-rank Test
├── Semi-parametric
│   ├── Cox Proportional Hazards (Cox PH)
│   └── Stratified Cox
├── Parametric
│   ├── Exponential
│   ├── Weibull
│   ├── Log-normal
│   └── Accelerated Failure Time (AFT)
├── Machine Learning
│   ├── Random Survival Forest
│   ├── Gradient Boosting Survival
│   └── DeepSurv (Deep Learning)
└── Time-varying / Competing Risks
    ├── Time-varying covariates
    └── Competing Risks (Fine-Gray)

Kaplan-Meier Estimator¶

비모수적 생존 함수 추정:

\[\hat{S}(t) = \prod_{t_i \leq t} \left(1 - \frac{d_i}{n_i}\right)\]

여기서: - \(d_i\): 시점 \(t_i\)의 이벤트 수 - \(n_i\): 시점 \(t_i\) 직전의 위험 집합 크기

from lifelines import KaplanMeierFitter
import matplotlib.pyplot as plt

kmf = KaplanMeierFitter()

# 전체 데이터
kmf.fit(durations=df['duration'], event_observed=df['event'])
kmf.plot_survival_function()

# 그룹별 비교
for group in df['group'].unique():
    mask = df['group'] == group
    kmf.fit(df.loc[mask, 'duration'], df.loc[mask, 'event'], label=group)
    kmf.plot_survival_function()

plt.xlabel('Time')
plt.ylabel('Survival Probability')
plt.show()

# 중앙값 생존 시간
print(f"Median survival time: {kmf.median_survival_time_}")

Log-rank Test¶

두 그룹 생존 곡선 비교:

from lifelines.statistics import logrank_test

group1 = df[df['treatment'] == 1]
group2 = df[df['treatment'] == 0]

result = logrank_test(
    group1['duration'], group2['duration'],
    event_observed_A=group1['event'], 
    event_observed_B=group2['event']
)
print(f"p-value: {result.p_value:.4f}")

Cox Proportional Hazards¶

\[h(t|X) = h_0(t) \exp(\beta_1 X_1 + \beta_2 X_2 + ... + \beta_p X_p)\]

\(h_0(t)\): 기저 위험 함수 (추정하지 않음)
\(\exp(\beta_i)\): Hazard Ratio

가정: Proportional Hazards - 위험 비율이 시간에 따라 일정

from lifelines import CoxPHFitter

cph = CoxPHFitter()
cph.fit(df, duration_col='duration', event_col='event')

# 결과 요약
cph.print_summary()

# Hazard Ratios
print(cph.hazard_ratios_)

# 예측
survival_func = cph.predict_survival_function(df.iloc[:5])
survival_func.plot()

# PH 가정 검정
cph.check_assumptions(df, show_plots=True)

Hazard Ratio 해석¶

HR	해석
HR = 1	효과 없음
HR > 1	위험 증가 (이벤트 빨리 발생)
HR < 1	보호 효과 (이벤트 늦게 발생)

예: HR = 2.0이면 이벤트 발생 위험이 2배.

Parametric Models¶

Weibull Model¶

\[h(t) = \lambda \rho t^{\rho - 1}\]

\(\rho > 1\): 위험 증가 (aging)
\(\rho < 1\): 위험 감소 (burn-in)
\(\rho = 1\): 일정 위험 (Exponential)

from lifelines import WeibullFitter, WeibullAFTFitter

# Univariate Weibull
wf = WeibullFitter()
wf.fit(df['duration'], df['event'])
wf.plot_survival_function()

# Weibull AFT (Accelerated Failure Time)
aft = WeibullAFTFitter()
aft.fit(df, duration_col='duration', event_col='event')
aft.print_summary()

Accelerated Failure Time (AFT)¶

시간 스케일에 영향:

\[\log T = \beta_0 + \beta_1 X_1 + ... + \sigma \epsilon\]

해석: \(\exp(\beta_i)\)는 시간 가속/감속 비율

Machine Learning for Survival¶

Random Survival Forest¶

from sksurv.ensemble import RandomSurvivalForest
from sksurv.util import Surv

# 데이터 형식 변환
y = Surv.from_dataframe('event', 'duration', df)

rsf = RandomSurvivalForest(
    n_estimators=100,
    min_samples_split=10,
    min_samples_leaf=5,
    random_state=42
)
rsf.fit(X, y)

# C-index
score = rsf.score(X_test, y_test)
print(f"C-index: {score:.4f}")

# 예측
surv_funcs = rsf.predict_survival_function(X_test)

DeepSurv¶

신경망 기반 Cox PH:

from pycox.models import CoxPH
import torchtuples as tt

net = tt.practical.MLPVanilla(
    in_features=X_train.shape[1],
    num_nodes=[32, 32],
    out_features=1,
    batch_norm=True,
    dropout=0.1
)

model = CoxPH(net, tt.optim.Adam)
model.fit(X_train, (durations_train, events_train), epochs=100, verbose=True)

# 예측
surv = model.predict_surv_df(X_test)

평가 지표¶

Concordance Index (C-index)¶

예측 위험 순위의 정확도:

\[C = \frac{\sum_{i,j} \mathbf{1}[T_i < T_j] \cdot \mathbf{1}[\hat{r}_i > \hat{r}_j]}{\sum_{i,j} \mathbf{1}[T_i < T_j]}\]

from lifelines.utils import concordance_index

c_index = concordance_index(
    df['duration'],
    -cph.predict_partial_hazard(df),  # 음수로 변환 (높은 위험 = 짧은 생존)
    df['event']
)
print(f"C-index: {c_index:.4f}")

Brier Score¶

from sksurv.metrics import brier_score

times, brier_scores = brier_score(y_train, y_test, surv_funcs, times)

응용 분야¶

분야	이벤트	예시
의료	사망, 재발	임상시험
마케팅	이탈	고객 Churn
신뢰성 공학	고장	장비 수명 예측
HR	퇴사	직원 이탈
금융	연체, 부도	신용 리스크

참고 문헌¶

교과서¶

Kleinbaum, D.G. & Klein, M. (2012). "Survival Analysis: A Self-Learning Text". Springer.
Hosmer, D.W. et al. (2008). "Applied Survival Analysis". Wiley.

핵심 논문¶

Cox, D.R. (1972). "Regression Models and Life-Tables". JRSS-B.
Katzman, J.L. et al. (2018). "DeepSurv". BMC Medical Research Methodology.

라이브러리¶

lifelines: https://lifelines.readthedocs.io/
scikit-survival: https://scikit-survival.readthedocs.io/
PyCox: https://github.com/havakv/pycox