인과추론 개요¶

인과추론은 "X가 Y를 발생시키는가?"라는 질문에 답하는 방법론. 상관관계(correlation)가 아닌 인과관계(causation)를 추정하며, 올바른 의사결정을 위한 필수 도구다.

이론적 기초¶

상관관계 vs 인과관계¶

상관관계: X와 Y가 함께 변함
   - 아이스크림 판매 ↔ 익사 사고 (공통 원인: 더운 날씨)

인과관계: X가 Y를 발생시킴 (X → Y)
   - 약물 복용 → 증상 완화

Simpson's Paradox 예시:

전체 데이터에서 보이는 관계가 하위 그룹에서는 반대로 나타남.

전체	치료군 회복률	대조군 회복률
통합	50%	60%

중증도별	치료군	대조군
경증	90%	85%
중증	30%	25%

교란변수(중증도)를 통제해야 진정한 치료 효과를 알 수 있음.

Potential Outcomes Framework (Rubin Causal Model)¶

핵심 개념: - $Y_i(1)$: 개체 $i$가 처리를 받았을 때의 잠재적 결과 - $Y_i(0)$: 개체 $i$가 처리를 받지 않았을 때의 잠재적 결과 - $T_i \in \{0, 1\}$: 처리 여부 - $Y_i^{obs} = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0)$: 관측된 결과

개별 처리 효과 (ITE):

\[\tau_i = Y_i(1) - Y_i(0)\]

근본적 문제 (Fundamental Problem of Causal Inference):

한 개체에서 $Y_i(1)$과 $Y_i(0)$을 동시에 관측할 수 없음.

평균 처리 효과 (ATE):

\[\tau_{ATE} = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]\]

처리군에 대한 평균 처리 효과 (ATT):

\[\tau_{ATT} = \mathbb{E}[Y(1) - Y(0) | T = 1]\]

참고 논문: - Rubin, D.B. (1974). "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies". Journal of Educational Psychology. - Holland, P.W. (1986). "Statistics and Causal Inference". JASA.

인과 그래프 (Structural Causal Models)¶

Pearl의 구조적 인과 모델 (SCM):

DAG (Directed Acyclic Graph):

     U_X    U_Y
      ↓      ↓
  Z → X  →  Y
      ↓
      M

인과 효과 식별:

\[P(Y | do(X=x)) \neq P(Y | X=x)\]

$P(Y | X=x)$: 조건부 확률 (관측)
$P(Y | do(X=x))$: 개입 확률 (인과)

Backdoor Criterion:

$Z$가 backdoor criterion을 만족하면:

\[P(Y | do(X)) = \sum_z P(Y | X, Z=z) P(Z=z)\]

참고 논문: - Pearl, J. (2009). "Causality: Models, Reasoning, and Inference". Cambridge University Press. - Pearl, J. (2000). "Causality". Cambridge University Press.

식별 가정 (Identification Assumptions)¶

핵심 가정¶

가정	의미	검증 가능 여부
SUTVA	한 개체의 처리가 다른 개체에 영향 없음	일부 검증 가능
Ignorability	$(Y(0), Y(1)) \perp T	X$
Overlap (Positivity)	$0 < P(T=1	X) < 1$
Consistency	$Y = Y(T)$	가정

Ignorability (Unconfoundedness):

관측된 공변량 $X$를 조건으로 하면 처리 배정이 잠재적 결과와 독립:

\[Y(0), Y(1) \perp T | X\]

"Selection on observables" - 미관측 교란변수가 없다는 가정.

알고리즘 분류 체계¶

Causal Inference Methods
├── Experimental
│   └── Randomized Controlled Trial (RCT)
├── Quasi-experimental
│   ├── Difference-in-Differences (DiD)
│   ├── Regression Discontinuity (RDD)
│   ├── Instrumental Variables (IV)
│   └── Synthetic Control
├── Observational (Adjustment)
│   ├── Regression Adjustment
│   ├── Matching
│   │   ├── Exact Matching
│   │   ├── Propensity Score Matching
│   │   ├── Coarsened Exact Matching
│   │   └── Genetic Matching
│   ├── Weighting
│   │   ├── Inverse Probability Weighting (IPW)
│   │   ├── Augmented IPW (AIPW)
│   │   └── Entropy Balancing
│   └── Stratification
├── Machine Learning for Causal Inference
│   ├── Double Machine Learning (DML)
│   ├── Causal Forests
│   ├── CATE Estimation (X-learner, T-learner, S-learner)
│   ├── Targeted Maximum Likelihood (TMLE)
│   └── Bayesian Additive Regression Trees (BART)
└── Sensitivity Analysis
    ├── Rosenbaum Bounds
    ├── E-value
    └── Confounding Function

Randomized Controlled Trial (RCT)¶

Gold Standard¶

무작위 배정으로 교란변수 문제 해결:

\[\mathbb{E}[Y | T=1] - \mathbb{E}[Y | T=0] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \tau_{ATE}\]

무작위화의 역할: - 관측/미관측 모든 교란변수 균형 - $(Y(0), Y(1)) \perp T$ 보장

A/B 테스트 분석¶

import numpy as np
from scipy import stats

def ab_test_analysis(control, treatment, alpha=0.05):
    """
    A/B 테스트 분석
    """
    # 기본 통계
    n_c, n_t = len(control), len(treatment)
    mean_c, mean_t = control.mean(), treatment.mean()
    var_c, var_t = control.var(ddof=1), treatment.var(ddof=1)

    # ATE 추정
    ate = mean_t - mean_c

    # 표준 오차 (Welch's t-test)
    se = np.sqrt(var_c / n_c + var_t / n_t)

    # 신뢰구간
    t_crit = stats.t.ppf(1 - alpha/2, df=min(n_c, n_t) - 1)
    ci_lower = ate - t_crit * se
    ci_upper = ate + t_crit * se

    # p-value
    t_stat = ate / se
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=min(n_c, n_t) - 1))

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((n_c-1)*var_c + (n_t-1)*var_t) / (n_c + n_t - 2))
    cohens_d = ate / pooled_std

    return {
        'ate': ate,
        'se': se,
        'ci': (ci_lower, ci_upper),
        'p_value': p_value,
        'cohens_d': cohens_d,
        'significant': p_value < alpha
    }

관측 데이터 방법론¶

Propensity Score Methods¶

Propensity Score:

\[e(X) = P(T=1 | X)\]

핵심 성질 (Rosenbaum & Rubin, 1983):

Ignorability가 $X$에 대해 성립하면, $e(X)$에 대해서도 성립:

\[Y(0), Y(1) \perp T | e(X)\]

Propensity Score Matching¶

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np

class PropensityScoreMatching:
    def __init__(self, caliper=0.2):
        self.caliper = caliper  # 허용 거리 (표준편차 단위)

    def fit(self, X, T):
        """Propensity score 추정"""
        self.ps_model = LogisticRegression(max_iter=1000)
        self.ps_model.fit(X, T)
        self.propensity_scores = self.ps_model.predict_proba(X)[:, 1]
        self.X = X
        self.T = T
        return self

    def match(self, Y):
        """1:1 nearest neighbor matching"""
        treated_idx = np.where(self.T == 1)[0]
        control_idx = np.where(self.T == 0)[0]

        ps_treated = self.propensity_scores[treated_idx].reshape(-1, 1)
        ps_control = self.propensity_scores[control_idx].reshape(-1, 1)

        # Caliper
        caliper_threshold = self.caliper * self.propensity_scores.std()

        nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
        nn.fit(ps_control)
        distances, indices = nn.kneighbors(ps_treated)

        # 매칭된 쌍
        matched_pairs = []
        for i, (dist, ctrl_idx) in enumerate(zip(distances, indices)):
            if dist[0] <= caliper_threshold:
                matched_pairs.append((treated_idx[i], control_idx[ctrl_idx[0]]))

        matched_pairs = np.array(matched_pairs)

        # ATE 추정
        Y_treated = Y[matched_pairs[:, 0]]
        Y_control = Y[matched_pairs[:, 1]]

        ate = (Y_treated - Y_control).mean()
        se = (Y_treated - Y_control).std() / np.sqrt(len(matched_pairs))

        return {
            'ate': ate,
            'se': se,
            'n_matched': len(matched_pairs),
            'n_treated': len(treated_idx),
            'matched_pairs': matched_pairs
        }

Inverse Probability Weighting (IPW)¶

\[\hat{\tau}_{IPW} = \frac{1}{n}\sum_{i=1}^{n}\left[\frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i) Y_i}{1 - e(X_i)}\right]\]

Horvitz-Thompson 추정량:

각 관측치를 선택 확률의 역수로 가중:

def ipw_estimator(Y, T, propensity_scores, clip=0.01):
    """
    Inverse Probability Weighting
    """
    # Propensity score clipping (극단값 방지)
    ps_clipped = np.clip(propensity_scores, clip, 1 - clip)

    # IPW weights
    weights = T / ps_clipped + (1 - T) / (1 - ps_clipped)

    # Normalized IPW
    ate = (T * Y / ps_clipped).sum() / (T / ps_clipped).sum() - \
          ((1-T) * Y / (1-ps_clipped)).sum() / ((1-T) / (1-ps_clipped)).sum()

    return ate

Augmented IPW (AIPW / Doubly Robust)¶

\[\hat{\tau}_{AIPW} = \frac{1}{n}\sum_{i=1}^{n}\left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i(Y_i - \hat{\mu}_1(X_i))}{e(X_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1-e(X_i)}\right]\]

Doubly Robust: Propensity model 또는 outcome model 중 하나만 맞아도 일치 추정량

참고 논문: - Robins, J.M. et al. (1994). "Estimation of Regression Coefficients When Some Regressors Are Not Always Observed". JASA.

Quasi-experimental Methods¶

Difference-in-Differences (DiD)¶

설정: - 처리군(Treatment)과 대조군(Control) - 처리 전(Pre)과 처리 후(Post) 기간

핵심 가정: Parallel Trends - 처리가 없었다면 두 그룹의 추세가 평행했을 것

\[\tau_{DiD} = (Y_{T,post} - Y_{T,pre}) - (Y_{C,post} - Y_{C,pre})\]

import pandas as pd
import statsmodels.formula.api as smf

def did_analysis(df, outcome, treatment, post, entity=None):
    """
    Difference-in-Differences

    df: DataFrame with columns [outcome, treatment, post, entity(optional)]
    """
    # DiD regression
    formula = f'{outcome} ~ {treatment} * {post}'

    if entity:
        formula += f' + C({entity})'  # Entity fixed effects

    model = smf.ols(formula, data=df).fit(cov_type='cluster', 
                                          cov_kwds={'groups': df[entity]} if entity else {})

    # DiD coefficient
    interaction_term = f'{treatment}:{post}'
    did_estimate = model.params[interaction_term]
    did_se = model.bse[interaction_term]
    did_pvalue = model.pvalues[interaction_term]

    return {
        'did_estimate': did_estimate,
        'se': did_se,
        'p_value': did_pvalue,
        'ci_lower': model.conf_int().loc[interaction_term, 0],
        'ci_upper': model.conf_int().loc[interaction_term, 1],
        'model_summary': model.summary()
    }

참고 논문: - Card, D. & Krueger, A.B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry". AER.

Regression Discontinuity Design (RDD)¶

설정: 임계값(cutoff) 기준으로 처리 배정

\[T_i = \mathbf{1}[X_i \geq c]\]

핵심 가정: 임계값 근처에서 잠재적 결과 함수가 연속

추정 (Local Linear Regression):

\[\hat{\tau}_{RDD} = \lim_{x \to c^+} \mathbb{E}[Y|X=x] - \lim_{x \to c^-} \mathbb{E}[Y|X=x]\]

from sklearn.linear_model import LinearRegression
import numpy as np

def rdd_analysis(X, Y, cutoff, bandwidth=None):
    """
    Regression Discontinuity Design
    """
    # Centering
    X_centered = X - cutoff
    T = (X >= cutoff).astype(int)

    # Bandwidth selection (Imbens-Kalyanaraman optimal bandwidth)
    if bandwidth is None:
        bandwidth = 1.06 * X_centered.std() * len(X)**(-1/5)  # Rule of thumb

    # Local sample
    local_mask = np.abs(X_centered) <= bandwidth
    X_local = X_centered[local_mask]
    Y_local = Y[local_mask]
    T_local = T[local_mask]

    # Local linear regression with interaction
    # Y = a + b*X + c*T + d*T*X + e
    design = np.column_stack([
        np.ones(len(X_local)),
        X_local,
        T_local,
        T_local * X_local
    ])

    model = LinearRegression(fit_intercept=False)
    model.fit(design, Y_local)

    # RDD estimate is coefficient on T
    rdd_estimate = model.coef_[2]

    # Bootstrap for standard error
    n_bootstrap = 1000
    bootstrap_estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(len(X_local), size=len(X_local), replace=True)
        model_boot = LinearRegression(fit_intercept=False)
        model_boot.fit(design[idx], Y_local[idx])
        bootstrap_estimates.append(model_boot.coef_[2])

    se = np.std(bootstrap_estimates)

    return {
        'rdd_estimate': rdd_estimate,
        'se': se,
        'bandwidth': bandwidth,
        'n_local': local_mask.sum()
    }

참고 논문: - Imbens, G.W. & Lemieux, T. (2008). "Regression Discontinuity Designs: A Guide to Practice". JoE. - Cattaneo, M.D. et al. (2019). "A Practical Introduction to Regression Discontinuity Designs". Cambridge.

Instrumental Variables (IV)¶

설정: 처리에 영향을 주지만 결과에 직접 영향을 주지 않는 도구변수 $Z$

가정: 1. Relevance: $Cov(Z, T) \neq 0$ 2. Exclusion: $Z$가 $Y$에 $T$를 통해서만 영향 3. Independence: $Z \perp (Y(0), Y(1))$

2SLS (Two-Stage Least Squares):

Stage 1: $T = \alpha_0 + \alpha_1 Z + \nu$ Stage 2: $Y = \beta_0 + \beta_1 \hat{T} + \epsilon$

\[\hat{\tau}_{IV} = \frac{Cov(Y, Z)}{Cov(T, Z)} = \frac{\text{Reduced Form}}{\text{First Stage}}\]

from linearmodels.iv import IV2SLS
import pandas as pd

def iv_analysis(df, outcome, treatment, instrument, covariates=None):
    """
    Instrumental Variables (2SLS)
    """
    # Formula
    if covariates:
        cov_str = ' + '.join(covariates)
        formula = f'{outcome} ~ 1 + {cov_str} + [{treatment} ~ {instrument}]'
    else:
        formula = f'{outcome} ~ 1 + [{treatment} ~ {instrument}]'

    model = IV2SLS.from_formula(formula, data=df).fit(cov_type='robust')

    return {
        'iv_estimate': model.params[treatment],
        'se': model.std_errors[treatment],
        'p_value': model.pvalues[treatment],
        'first_stage_f': model.first_stage.diagnostics['f.stat'].stat,  # F-stat > 10 권장
        'summary': model.summary
    }

참고 논문: - Angrist, J.D. et al. (1996). "Identification of Causal Effects Using Instrumental Variables". JASA.

Machine Learning for Causal Inference¶

Double Machine Learning (DML)¶

Neyman orthogonal scores를 사용하여 ML 추정 오차의 영향을 제거:

Partially Linear Model:

\[Y = \theta T + g(X) + U$$ $$T = m(X) + V\]

DML 절차: 1. $\hat{g}(X) = \mathbb{E}[Y|X]$ 추정 (outcome model) 2. $\hat{m}(X) = \mathbb{E}[T|X]$ 추정 (propensity model) 3. 잔차 계산: $\tilde{Y} = Y - \hat{g}(X)$, $\tilde{T} = T - \hat{m}(X)$ 4. $\hat{\theta} = (\tilde{T}^T \tilde{T})^{-1} \tilde{T}^T \tilde{Y}$

Cross-fitting: Overfitting 방지를 위해 sample splitting

from econml.dml import LinearDML, CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

def double_ml_analysis(X, T, Y):
    """
    Double Machine Learning
    """
    # Model specification
    model_y = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    model_t = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

    # Linear DML (homogeneous effect)
    dml = LinearDML(
        model_y=model_y,
        model_t=model_t,
        discrete_treatment=True,
        cv=5,
        random_state=42
    )
    dml.fit(Y, T, X=X)

    ate = dml.ate(X)
    ate_inference = dml.ate_inference(X)

    return {
        'ate': ate,
        'se': ate_inference.std_err,
        'ci_lower': ate_inference.conf_int()[0][0],
        'ci_upper': ate_inference.conf_int()[0][1],
        'p_value': ate_inference.pvalue[0]
    }

참고 논문: - Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters". The Econometrics Journal.

Causal Forest (Heterogeneous Treatment Effects)¶

CATE (Conditional Average Treatment Effect):

\[\tau(x) = \mathbb{E}[Y(1) - Y(0) | X = x]\]

알고리즘 (Generalized Random Forest): 1. Local moment conditions 기반 splitting 2. Honest estimation (train/estimate split) 3. 가중치 기반 국소 추정

from econml.grf import CausalForest
from econml.cate_interpreter import SingleTreeCateInterpreter

def causal_forest_analysis(X, T, Y):
    """
    Causal Forest for Heterogeneous Treatment Effects
    """
    cf = CausalForest(
        n_estimators=1000,
        min_samples_leaf=5,
        max_depth=None,
        honest=True,
        random_state=42
    )
    cf.fit(Y, T, X=X)

    # Individual treatment effects
    cate = cf.predict(X)
    cate_inference = cf.predict_interval(X, alpha=0.05)

    # Feature importance for heterogeneity
    feature_importance = cf.feature_importances_

    # Policy tree interpretation
    interpreter = SingleTreeCateInterpreter(max_depth=3)
    interpreter.interpret(cf, X)

    return {
        'cate': cate,
        'cate_lower': cate_inference[0],
        'cate_upper': cate_inference[1],
        'ate': cate.mean(),
        'feature_importance': feature_importance,
        'policy_tree': interpreter
    }

참고 논문: - Wager, S. & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests". JASA. - Athey, S. et al. (2019). "Generalized Random Forests". Annals of Statistics.

Meta-learners (CATE Estimation)¶

Learner	접근법	장점	단점
T-learner	처리군/대조군 별도 모델	유연함	데이터 분할로 효율 저하
S-learner	단일 모델, T를 특성으로	간단	효과 포착 어려움
X-learner	T → S → 가중 결합	불균형 처리에 강함	복잡
R-learner	Residual-on-residual	이론적 보장
DR-learner	Doubly robust	효율적, 강건

from econml.metalearners import TLearner, SLearner, XLearner
from sklearn.ensemble import GradientBoostingRegressor

# T-Learner
t_learner = TLearner(models=GradientBoostingRegressor())
t_learner.fit(Y, T, X=X)
cate_t = t_learner.effect(X)

# S-Learner
s_learner = SLearner(overall_model=GradientBoostingRegressor())
s_learner.fit(Y, T, X=X)
cate_s = s_learner.effect(X)

# X-Learner (handles imbalanced treatment)
x_learner = XLearner(models=GradientBoostingRegressor())
x_learner.fit(Y, T, X=X)
cate_x = x_learner.effect(X)

참고 논문: - Kunzel, S.R. et al. (2019). "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning". PNAS.

Sensitivity Analysis¶

미관측 교란 분석¶

Rosenbaum Bounds:

처리 배정에 미관측 교란이 있을 때 효과 추정의 변동 범위

E-value:

관측된 효과를 없앨 수 있는 미관측 교란의 최소 강도

\[E\text{-value} = RR + \sqrt{RR \cdot (RR - 1)}\]

def calculate_e_value(point_estimate, ci_lower=None):
    """
    E-value 계산
    point_estimate: Risk Ratio 또는 Odds Ratio
    """
    rr = point_estimate
    e_value = rr + np.sqrt(rr * (rr - 1))

    if ci_lower is not None and ci_lower > 1:
        e_value_ci = ci_lower + np.sqrt(ci_lower * (ci_lower - 1))
    else:
        e_value_ci = 1

    return {
        'e_value_point': e_value,
        'e_value_ci': e_value_ci
    }

참고 논문: - VanderWeele, T.J. & Ding, P. (2017). "Sensitivity Analysis in Observational Research: Introducing the E-Value". Annals of Internal Medicine.

실무 적용 가이드¶

방법 선택 흐름도¶

시작
│
├── RCT/A/B 테스트 가능?
│   ├── 예 → RCT
│   └── 아니오 →
│       │
│       ├── 처리 배정에 임계값 있음?
│       │   └── 예 → RDD
│       │
│       ├── 처리 전후 데이터 + 대조군?
│       │   └── 예 → DiD
│       │
│       ├── 좋은 도구변수 있음?
│       │   └── 예 → IV
│       │
│       └── 관측 데이터만?
│           │
│           ├── Ignorability 믿을 수 있음?
│           │   ├── 예 → Matching, IPW, DML
│           │   └── 아니오 → Sensitivity Analysis 필수
│           │
│           └── 이질적 효과 관심?
│               ├── 예 → Causal Forest, Meta-learners
│               └── 아니오 → DML, AIPW

체크리스트¶

연구 질문 명확화
ATE vs ATT vs CATE
인과 효과의 방향과 해석
가정 점검
Ignorability/Unconfoundedness
Overlap/Positivity
SUTVA
(DiD) Parallel Trends
(RDD) Continuity
(IV) Relevance, Exclusion
데이터 진단
Covariate balance 확인
Propensity score 분포
Common support 영역
민감도 분석
E-value 계산
다른 방법론으로 교차 검증

하위 문서¶

주제	설명	링크
Double Machine Learning	ML 기반 인과추론 상세	double-machine-learning.md

참고 문헌¶

교과서¶

Hernan, M.A. & Robins, J.M. (2020). "Causal Inference: What If". Chapman & Hall. (무료 온라인)
Imbens, G.W. & Rubin, D.B. (2015). "Causal Inference for Statistics, Social, and Biomedical Sciences". Cambridge.
Pearl, J. et al. (2016). "Causal Inference in Statistics: A Primer". Wiley.
Angrist, J.D. & Pischke, J.S. (2009). "Mostly Harmless Econometrics". Princeton.

핵심 논문¶

Rubin, D.B. (1974). "Estimating Causal Effects". Journal of Educational Psychology.
Rosenbaum, P.R. & Rubin, D.B. (1983). "The Central Role of the Propensity Score". Biometrika.
Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning". The Econometrics Journal.
Wager, S. & Athey, S. (2018). "Causal Forests". JASA.

라이브러리¶

EconML (Microsoft): https://github.com/microsoft/EconML
DoWhy: https://github.com/py-why/dowhy
CausalML (Uber): https://github.com/uber/causalml
grf (R): https://github.com/grf-labs/grf