인과추론 개요¶
인과추론은 "X가 Y를 발생시키는가?"라는 질문에 답하는 방법론. 상관관계(correlation)가 아닌 인과관계(causation)를 추정하며, 올바른 의사결정을 위한 필수 도구다.
이론적 기초¶
상관관계 vs 인과관계¶
Simpson's Paradox 예시:
전체 데이터에서 보이는 관계가 하위 그룹에서는 반대로 나타남.
| 전체 | 치료군 회복률 | 대조군 회복률 |
|---|---|---|
| 통합 | 50% | 60% |
| 중증도별 | 치료군 | 대조군 |
|---|---|---|
| 경증 | 90% | 85% |
| 중증 | 30% | 25% |
교란변수(중증도)를 통제해야 진정한 치료 효과를 알 수 있음.
Potential Outcomes Framework (Rubin Causal Model)¶
핵심 개념: - \(Y_i(1)\): 개체 \(i\)가 처리를 받았을 때의 잠재적 결과 - \(Y_i(0)\): 개체 \(i\)가 처리를 받지 않았을 때의 잠재적 결과 - \(T_i \in \{0, 1\}\): 처리 여부 - \(Y_i^{obs} = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0)\): 관측된 결과
개별 처리 효과 (ITE):
근본적 문제 (Fundamental Problem of Causal Inference):
한 개체에서 \(Y_i(1)\)과 \(Y_i(0)\)을 동시에 관측할 수 없음.
평균 처리 효과 (ATE):
처리군에 대한 평균 처리 효과 (ATT):
참고 논문: - Rubin, D.B. (1974). "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies". Journal of Educational Psychology. - Holland, P.W. (1986). "Statistics and Causal Inference". JASA.
인과 그래프 (Structural Causal Models)¶
Pearl의 구조적 인과 모델 (SCM):
DAG (Directed Acyclic Graph):
인과 효과 식별:
- \(P(Y | X=x)\): 조건부 확률 (관측)
- \(P(Y | do(X=x))\): 개입 확률 (인과)
Backdoor Criterion:
\(Z\)가 backdoor criterion을 만족하면:
참고 논문: - Pearl, J. (2009). "Causality: Models, Reasoning, and Inference". Cambridge University Press. - Pearl, J. (2000). "Causality". Cambridge University Press.
식별 가정 (Identification Assumptions)¶
핵심 가정¶
| 가정 | 의미 | 검증 가능 여부 |
|---|---|---|
| SUTVA | 한 개체의 처리가 다른 개체에 영향 없음 | 일부 검증 가능 |
| Ignorability | $(Y(0), Y(1)) \perp T | X$ |
| Overlap (Positivity) | $0 < P(T=1 | X) < 1$ |
| Consistency | \(Y = Y(T)\) | 가정 |
Ignorability (Unconfoundedness):
관측된 공변량 \(X\)를 조건으로 하면 처리 배정이 잠재적 결과와 독립:
"Selection on observables" - 미관측 교란변수가 없다는 가정.
알고리즘 분류 체계¶
Causal Inference Methods
├── Experimental
│ └── Randomized Controlled Trial (RCT)
├── Quasi-experimental
│ ├── Difference-in-Differences (DiD)
│ ├── Regression Discontinuity (RDD)
│ ├── Instrumental Variables (IV)
│ └── Synthetic Control
├── Observational (Adjustment)
│ ├── Regression Adjustment
│ ├── Matching
│ │ ├── Exact Matching
│ │ ├── Propensity Score Matching
│ │ ├── Coarsened Exact Matching
│ │ └── Genetic Matching
│ ├── Weighting
│ │ ├── Inverse Probability Weighting (IPW)
│ │ ├── Augmented IPW (AIPW)
│ │ └── Entropy Balancing
│ └── Stratification
├── Machine Learning for Causal Inference
│ ├── Double Machine Learning (DML)
│ ├── Causal Forests
│ ├── CATE Estimation (X-learner, T-learner, S-learner)
│ ├── Targeted Maximum Likelihood (TMLE)
│ └── Bayesian Additive Regression Trees (BART)
└── Sensitivity Analysis
├── Rosenbaum Bounds
├── E-value
└── Confounding Function
Randomized Controlled Trial (RCT)¶
Gold Standard¶
무작위 배정으로 교란변수 문제 해결:
무작위화의 역할: - 관측/미관측 모든 교란변수 균형 - \((Y(0), Y(1)) \perp T\) 보장
A/B 테스트 분석¶
import numpy as np
from scipy import stats
def ab_test_analysis(control, treatment, alpha=0.05):
"""
A/B 테스트 분석
"""
# 기본 통계
n_c, n_t = len(control), len(treatment)
mean_c, mean_t = control.mean(), treatment.mean()
var_c, var_t = control.var(ddof=1), treatment.var(ddof=1)
# ATE 추정
ate = mean_t - mean_c
# 표준 오차 (Welch's t-test)
se = np.sqrt(var_c / n_c + var_t / n_t)
# 신뢰구간
t_crit = stats.t.ppf(1 - alpha/2, df=min(n_c, n_t) - 1)
ci_lower = ate - t_crit * se
ci_upper = ate + t_crit * se
# p-value
t_stat = ate / se
p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=min(n_c, n_t) - 1))
# Effect size (Cohen's d)
pooled_std = np.sqrt(((n_c-1)*var_c + (n_t-1)*var_t) / (n_c + n_t - 2))
cohens_d = ate / pooled_std
return {
'ate': ate,
'se': se,
'ci': (ci_lower, ci_upper),
'p_value': p_value,
'cohens_d': cohens_d,
'significant': p_value < alpha
}
관측 데이터 방법론¶
Propensity Score Methods¶
Propensity Score:
핵심 성질 (Rosenbaum & Rubin, 1983):
Ignorability가 \(X\)에 대해 성립하면, \(e(X)\)에 대해서도 성립:
Propensity Score Matching¶
from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np
class PropensityScoreMatching:
def __init__(self, caliper=0.2):
self.caliper = caliper # 허용 거리 (표준편차 단위)
def fit(self, X, T):
"""Propensity score 추정"""
self.ps_model = LogisticRegression(max_iter=1000)
self.ps_model.fit(X, T)
self.propensity_scores = self.ps_model.predict_proba(X)[:, 1]
self.X = X
self.T = T
return self
def match(self, Y):
"""1:1 nearest neighbor matching"""
treated_idx = np.where(self.T == 1)[0]
control_idx = np.where(self.T == 0)[0]
ps_treated = self.propensity_scores[treated_idx].reshape(-1, 1)
ps_control = self.propensity_scores[control_idx].reshape(-1, 1)
# Caliper
caliper_threshold = self.caliper * self.propensity_scores.std()
nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
nn.fit(ps_control)
distances, indices = nn.kneighbors(ps_treated)
# 매칭된 쌍
matched_pairs = []
for i, (dist, ctrl_idx) in enumerate(zip(distances, indices)):
if dist[0] <= caliper_threshold:
matched_pairs.append((treated_idx[i], control_idx[ctrl_idx[0]]))
matched_pairs = np.array(matched_pairs)
# ATE 추정
Y_treated = Y[matched_pairs[:, 0]]
Y_control = Y[matched_pairs[:, 1]]
ate = (Y_treated - Y_control).mean()
se = (Y_treated - Y_control).std() / np.sqrt(len(matched_pairs))
return {
'ate': ate,
'se': se,
'n_matched': len(matched_pairs),
'n_treated': len(treated_idx),
'matched_pairs': matched_pairs
}
Inverse Probability Weighting (IPW)¶
Horvitz-Thompson 추정량:
각 관측치를 선택 확률의 역수로 가중:
def ipw_estimator(Y, T, propensity_scores, clip=0.01):
"""
Inverse Probability Weighting
"""
# Propensity score clipping (극단값 방지)
ps_clipped = np.clip(propensity_scores, clip, 1 - clip)
# IPW weights
weights = T / ps_clipped + (1 - T) / (1 - ps_clipped)
# Normalized IPW
ate = (T * Y / ps_clipped).sum() / (T / ps_clipped).sum() - \
((1-T) * Y / (1-ps_clipped)).sum() / ((1-T) / (1-ps_clipped)).sum()
return ate
Augmented IPW (AIPW / Doubly Robust)¶
Doubly Robust: Propensity model 또는 outcome model 중 하나만 맞아도 일치 추정량
참고 논문: - Robins, J.M. et al. (1994). "Estimation of Regression Coefficients When Some Regressors Are Not Always Observed". JASA.
Quasi-experimental Methods¶
Difference-in-Differences (DiD)¶
설정: - 처리군(Treatment)과 대조군(Control) - 처리 전(Pre)과 처리 후(Post) 기간
핵심 가정: Parallel Trends - 처리가 없었다면 두 그룹의 추세가 평행했을 것
import pandas as pd
import statsmodels.formula.api as smf
def did_analysis(df, outcome, treatment, post, entity=None):
"""
Difference-in-Differences
df: DataFrame with columns [outcome, treatment, post, entity(optional)]
"""
# DiD regression
formula = f'{outcome} ~ {treatment} * {post}'
if entity:
formula += f' + C({entity})' # Entity fixed effects
model = smf.ols(formula, data=df).fit(cov_type='cluster',
cov_kwds={'groups': df[entity]} if entity else {})
# DiD coefficient
interaction_term = f'{treatment}:{post}'
did_estimate = model.params[interaction_term]
did_se = model.bse[interaction_term]
did_pvalue = model.pvalues[interaction_term]
return {
'did_estimate': did_estimate,
'se': did_se,
'p_value': did_pvalue,
'ci_lower': model.conf_int().loc[interaction_term, 0],
'ci_upper': model.conf_int().loc[interaction_term, 1],
'model_summary': model.summary()
}
참고 논문: - Card, D. & Krueger, A.B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry". AER.
Regression Discontinuity Design (RDD)¶
설정: 임계값(cutoff) 기준으로 처리 배정
핵심 가정: 임계값 근처에서 잠재적 결과 함수가 연속
추정 (Local Linear Regression):
from sklearn.linear_model import LinearRegression
import numpy as np
def rdd_analysis(X, Y, cutoff, bandwidth=None):
"""
Regression Discontinuity Design
"""
# Centering
X_centered = X - cutoff
T = (X >= cutoff).astype(int)
# Bandwidth selection (Imbens-Kalyanaraman optimal bandwidth)
if bandwidth is None:
bandwidth = 1.06 * X_centered.std() * len(X)**(-1/5) # Rule of thumb
# Local sample
local_mask = np.abs(X_centered) <= bandwidth
X_local = X_centered[local_mask]
Y_local = Y[local_mask]
T_local = T[local_mask]
# Local linear regression with interaction
# Y = a + b*X + c*T + d*T*X + e
design = np.column_stack([
np.ones(len(X_local)),
X_local,
T_local,
T_local * X_local
])
model = LinearRegression(fit_intercept=False)
model.fit(design, Y_local)
# RDD estimate is coefficient on T
rdd_estimate = model.coef_[2]
# Bootstrap for standard error
n_bootstrap = 1000
bootstrap_estimates = []
for _ in range(n_bootstrap):
idx = np.random.choice(len(X_local), size=len(X_local), replace=True)
model_boot = LinearRegression(fit_intercept=False)
model_boot.fit(design[idx], Y_local[idx])
bootstrap_estimates.append(model_boot.coef_[2])
se = np.std(bootstrap_estimates)
return {
'rdd_estimate': rdd_estimate,
'se': se,
'bandwidth': bandwidth,
'n_local': local_mask.sum()
}
참고 논문: - Imbens, G.W. & Lemieux, T. (2008). "Regression Discontinuity Designs: A Guide to Practice". JoE. - Cattaneo, M.D. et al. (2019). "A Practical Introduction to Regression Discontinuity Designs". Cambridge.
Instrumental Variables (IV)¶
설정: 처리에 영향을 주지만 결과에 직접 영향을 주지 않는 도구변수 \(Z\)
가정: 1. Relevance: \(Cov(Z, T) \neq 0\) 2. Exclusion: \(Z\)가 \(Y\)에 \(T\)를 통해서만 영향 3. Independence: \(Z \perp (Y(0), Y(1))\)
2SLS (Two-Stage Least Squares):
Stage 1: \(T = \alpha_0 + \alpha_1 Z + \nu\) Stage 2: \(Y = \beta_0 + \beta_1 \hat{T} + \epsilon\)
from linearmodels.iv import IV2SLS
import pandas as pd
def iv_analysis(df, outcome, treatment, instrument, covariates=None):
"""
Instrumental Variables (2SLS)
"""
# Formula
if covariates:
cov_str = ' + '.join(covariates)
formula = f'{outcome} ~ 1 + {cov_str} + [{treatment} ~ {instrument}]'
else:
formula = f'{outcome} ~ 1 + [{treatment} ~ {instrument}]'
model = IV2SLS.from_formula(formula, data=df).fit(cov_type='robust')
return {
'iv_estimate': model.params[treatment],
'se': model.std_errors[treatment],
'p_value': model.pvalues[treatment],
'first_stage_f': model.first_stage.diagnostics['f.stat'].stat, # F-stat > 10 권장
'summary': model.summary
}
참고 논문: - Angrist, J.D. et al. (1996). "Identification of Causal Effects Using Instrumental Variables". JASA.
Machine Learning for Causal Inference¶
Double Machine Learning (DML)¶
Neyman orthogonal scores를 사용하여 ML 추정 오차의 영향을 제거:
Partially Linear Model:
DML 절차: 1. \(\hat{g}(X) = \mathbb{E}[Y|X]\) 추정 (outcome model) 2. \(\hat{m}(X) = \mathbb{E}[T|X]\) 추정 (propensity model) 3. 잔차 계산: \(\tilde{Y} = Y - \hat{g}(X)\), \(\tilde{T} = T - \hat{m}(X)\) 4. \(\hat{\theta} = (\tilde{T}^T \tilde{T})^{-1} \tilde{T}^T \tilde{Y}\)
Cross-fitting: Overfitting 방지를 위해 sample splitting
from econml.dml import LinearDML, CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier
def double_ml_analysis(X, T, Y):
"""
Double Machine Learning
"""
# Model specification
model_y = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
model_t = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)
# Linear DML (homogeneous effect)
dml = LinearDML(
model_y=model_y,
model_t=model_t,
discrete_treatment=True,
cv=5,
random_state=42
)
dml.fit(Y, T, X=X)
ate = dml.ate(X)
ate_inference = dml.ate_inference(X)
return {
'ate': ate,
'se': ate_inference.std_err,
'ci_lower': ate_inference.conf_int()[0][0],
'ci_upper': ate_inference.conf_int()[0][1],
'p_value': ate_inference.pvalue[0]
}
참고 논문: - Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters". The Econometrics Journal.
Causal Forest (Heterogeneous Treatment Effects)¶
CATE (Conditional Average Treatment Effect):
알고리즘 (Generalized Random Forest): 1. Local moment conditions 기반 splitting 2. Honest estimation (train/estimate split) 3. 가중치 기반 국소 추정
from econml.grf import CausalForest
from econml.cate_interpreter import SingleTreeCateInterpreter
def causal_forest_analysis(X, T, Y):
"""
Causal Forest for Heterogeneous Treatment Effects
"""
cf = CausalForest(
n_estimators=1000,
min_samples_leaf=5,
max_depth=None,
honest=True,
random_state=42
)
cf.fit(Y, T, X=X)
# Individual treatment effects
cate = cf.predict(X)
cate_inference = cf.predict_interval(X, alpha=0.05)
# Feature importance for heterogeneity
feature_importance = cf.feature_importances_
# Policy tree interpretation
interpreter = SingleTreeCateInterpreter(max_depth=3)
interpreter.interpret(cf, X)
return {
'cate': cate,
'cate_lower': cate_inference[0],
'cate_upper': cate_inference[1],
'ate': cate.mean(),
'feature_importance': feature_importance,
'policy_tree': interpreter
}
참고 논문: - Wager, S. & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests". JASA. - Athey, S. et al. (2019). "Generalized Random Forests". Annals of Statistics.
Meta-learners (CATE Estimation)¶
| Learner | 접근법 | 장점 | 단점 |
|---|---|---|---|
| T-learner | 처리군/대조군 별도 모델 | 유연함 | 데이터 분할로 효율 저하 |
| S-learner | 단일 모델, T를 특성으로 | 간단 | 효과 포착 어려움 |
| X-learner | T → S → 가중 결합 | 불균형 처리에 강함 | 복잡 |
| R-learner | Residual-on-residual | 이론적 보장 | |
| DR-learner | Doubly robust | 효율적, 강건 |
from econml.metalearners import TLearner, SLearner, XLearner
from sklearn.ensemble import GradientBoostingRegressor
# T-Learner
t_learner = TLearner(models=GradientBoostingRegressor())
t_learner.fit(Y, T, X=X)
cate_t = t_learner.effect(X)
# S-Learner
s_learner = SLearner(overall_model=GradientBoostingRegressor())
s_learner.fit(Y, T, X=X)
cate_s = s_learner.effect(X)
# X-Learner (handles imbalanced treatment)
x_learner = XLearner(models=GradientBoostingRegressor())
x_learner.fit(Y, T, X=X)
cate_x = x_learner.effect(X)
참고 논문: - Kunzel, S.R. et al. (2019). "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning". PNAS.
Sensitivity Analysis¶
미관측 교란 분석¶
Rosenbaum Bounds:
처리 배정에 미관측 교란이 있을 때 효과 추정의 변동 범위
E-value:
관측된 효과를 없앨 수 있는 미관측 교란의 최소 강도
def calculate_e_value(point_estimate, ci_lower=None):
"""
E-value 계산
point_estimate: Risk Ratio 또는 Odds Ratio
"""
rr = point_estimate
e_value = rr + np.sqrt(rr * (rr - 1))
if ci_lower is not None and ci_lower > 1:
e_value_ci = ci_lower + np.sqrt(ci_lower * (ci_lower - 1))
else:
e_value_ci = 1
return {
'e_value_point': e_value,
'e_value_ci': e_value_ci
}
참고 논문: - VanderWeele, T.J. & Ding, P. (2017). "Sensitivity Analysis in Observational Research: Introducing the E-Value". Annals of Internal Medicine.
실무 적용 가이드¶
방법 선택 흐름도¶
시작
│
├── RCT/A/B 테스트 가능?
│ ├── 예 → RCT
│ └── 아니오 →
│ │
│ ├── 처리 배정에 임계값 있음?
│ │ └── 예 → RDD
│ │
│ ├── 처리 전후 데이터 + 대조군?
│ │ └── 예 → DiD
│ │
│ ├── 좋은 도구변수 있음?
│ │ └── 예 → IV
│ │
│ └── 관측 데이터만?
│ │
│ ├── Ignorability 믿을 수 있음?
│ │ ├── 예 → Matching, IPW, DML
│ │ └── 아니오 → Sensitivity Analysis 필수
│ │
│ └── 이질적 효과 관심?
│ ├── 예 → Causal Forest, Meta-learners
│ └── 아니오 → DML, AIPW
체크리스트¶
- 연구 질문 명확화
- ATE vs ATT vs CATE
-
인과 효과의 방향과 해석
-
가정 점검
- Ignorability/Unconfoundedness
- Overlap/Positivity
- SUTVA
- (DiD) Parallel Trends
- (RDD) Continuity
-
(IV) Relevance, Exclusion
-
데이터 진단
- Covariate balance 확인
- Propensity score 분포
-
Common support 영역
-
민감도 분석
- E-value 계산
- 다른 방법론으로 교차 검증
하위 문서¶
| 주제 | 설명 | 링크 |
|---|---|---|
| Double Machine Learning | ML 기반 인과추론 상세 | double-machine-learning.md |
참고 문헌¶
교과서¶
- Hernan, M.A. & Robins, J.M. (2020). "Causal Inference: What If". Chapman & Hall. (무료 온라인)
- Imbens, G.W. & Rubin, D.B. (2015). "Causal Inference for Statistics, Social, and Biomedical Sciences". Cambridge.
- Pearl, J. et al. (2016). "Causal Inference in Statistics: A Primer". Wiley.
- Angrist, J.D. & Pischke, J.S. (2009). "Mostly Harmless Econometrics". Princeton.
핵심 논문¶
- Rubin, D.B. (1974). "Estimating Causal Effects". Journal of Educational Psychology.
- Rosenbaum, P.R. & Rubin, D.B. (1983). "The Central Role of the Propensity Score". Biometrika.
- Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning". The Econometrics Journal.
- Wager, S. & Athey, S. (2018). "Causal Forests". JASA.
라이브러리¶
- EconML (Microsoft): https://github.com/microsoft/EconML
- DoWhy: https://github.com/py-why/dowhy
- CausalML (Uber): https://github.com/uber/causalml
- grf (R): https://github.com/grf-labs/grf