콘텐츠로 이동
Data Prep
상세

인과추론 개요

인과추론은 "X가 Y를 발생시키는가?"라는 질문에 답하는 방법론. 상관관계(correlation)가 아닌 인과관계(causation)를 추정하며, 올바른 의사결정을 위한 필수 도구다.


이론적 기초

상관관계 vs 인과관계

상관관계: X와 Y가 함께 변함
   - 아이스크림 판매 ↔ 익사 사고 (공통 원인: 더운 날씨)

인과관계: X가 Y를 발생시킴 (X → Y)
   - 약물 복용 → 증상 완화

Simpson's Paradox 예시:

전체 데이터에서 보이는 관계가 하위 그룹에서는 반대로 나타남.

전체 치료군 회복률 대조군 회복률
통합 50% 60%
중증도별 치료군 대조군
경증 90% 85%
중증 30% 25%

교란변수(중증도)를 통제해야 진정한 치료 효과를 알 수 있음.

Potential Outcomes Framework (Rubin Causal Model)

핵심 개념: - \(Y_i(1)\): 개체 \(i\)가 처리를 받았을 때의 잠재적 결과 - \(Y_i(0)\): 개체 \(i\)가 처리를 받지 않았을 때의 잠재적 결과 - \(T_i \in \{0, 1\}\): 처리 여부 - \(Y_i^{obs} = T_i \cdot Y_i(1) + (1 - T_i) \cdot Y_i(0)\): 관측된 결과

개별 처리 효과 (ITE):

\[\tau_i = Y_i(1) - Y_i(0)\]

근본적 문제 (Fundamental Problem of Causal Inference):

한 개체에서 \(Y_i(1)\)\(Y_i(0)\)을 동시에 관측할 수 없음.

평균 처리 효과 (ATE):

\[\tau_{ATE} = \mathbb{E}[Y(1) - Y(0)] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)]\]

처리군에 대한 평균 처리 효과 (ATT):

\[\tau_{ATT} = \mathbb{E}[Y(1) - Y(0) | T = 1]\]

참고 논문: - Rubin, D.B. (1974). "Estimating Causal Effects of Treatments in Randomized and Nonrandomized Studies". Journal of Educational Psychology. - Holland, P.W. (1986). "Statistics and Causal Inference". JASA.

인과 그래프 (Structural Causal Models)

Pearl의 구조적 인과 모델 (SCM):

DAG (Directed Acyclic Graph):

     U_X    U_Y
      ↓      ↓
  Z → X  →  Y
      M

인과 효과 식별:

\[P(Y | do(X=x)) \neq P(Y | X=x)\]
  • \(P(Y | X=x)\): 조건부 확률 (관측)
  • \(P(Y | do(X=x))\): 개입 확률 (인과)

Backdoor Criterion:

\(Z\)가 backdoor criterion을 만족하면:

\[P(Y | do(X)) = \sum_z P(Y | X, Z=z) P(Z=z)\]

참고 논문: - Pearl, J. (2009). "Causality: Models, Reasoning, and Inference". Cambridge University Press. - Pearl, J. (2000). "Causality". Cambridge University Press.


식별 가정 (Identification Assumptions)

핵심 가정

가정 의미 검증 가능 여부
SUTVA 한 개체의 처리가 다른 개체에 영향 없음 일부 검증 가능
Ignorability $(Y(0), Y(1)) \perp T X$
Overlap (Positivity) $0 < P(T=1 X) < 1$
Consistency \(Y = Y(T)\) 가정

Ignorability (Unconfoundedness):

관측된 공변량 \(X\)를 조건으로 하면 처리 배정이 잠재적 결과와 독립:

\[Y(0), Y(1) \perp T | X\]

"Selection on observables" - 미관측 교란변수가 없다는 가정.


알고리즘 분류 체계

Causal Inference Methods
├── Experimental
│   └── Randomized Controlled Trial (RCT)
├── Quasi-experimental
│   ├── Difference-in-Differences (DiD)
│   ├── Regression Discontinuity (RDD)
│   ├── Instrumental Variables (IV)
│   └── Synthetic Control
├── Observational (Adjustment)
│   ├── Regression Adjustment
│   ├── Matching
│   │   ├── Exact Matching
│   │   ├── Propensity Score Matching
│   │   ├── Coarsened Exact Matching
│   │   └── Genetic Matching
│   ├── Weighting
│   │   ├── Inverse Probability Weighting (IPW)
│   │   ├── Augmented IPW (AIPW)
│   │   └── Entropy Balancing
│   └── Stratification
├── Machine Learning for Causal Inference
│   ├── Double Machine Learning (DML)
│   ├── Causal Forests
│   ├── CATE Estimation (X-learner, T-learner, S-learner)
│   ├── Targeted Maximum Likelihood (TMLE)
│   └── Bayesian Additive Regression Trees (BART)
└── Sensitivity Analysis
    ├── Rosenbaum Bounds
    ├── E-value
    └── Confounding Function

Randomized Controlled Trial (RCT)

Gold Standard

무작위 배정으로 교란변수 문제 해결:

\[\mathbb{E}[Y | T=1] - \mathbb{E}[Y | T=0] = \mathbb{E}[Y(1)] - \mathbb{E}[Y(0)] = \tau_{ATE}\]

무작위화의 역할: - 관측/미관측 모든 교란변수 균형 - \((Y(0), Y(1)) \perp T\) 보장

A/B 테스트 분석

import numpy as np
from scipy import stats

def ab_test_analysis(control, treatment, alpha=0.05):
    """
    A/B 테스트 분석
    """
    # 기본 통계
    n_c, n_t = len(control), len(treatment)
    mean_c, mean_t = control.mean(), treatment.mean()
    var_c, var_t = control.var(ddof=1), treatment.var(ddof=1)

    # ATE 추정
    ate = mean_t - mean_c

    # 표준 오차 (Welch's t-test)
    se = np.sqrt(var_c / n_c + var_t / n_t)

    # 신뢰구간
    t_crit = stats.t.ppf(1 - alpha/2, df=min(n_c, n_t) - 1)
    ci_lower = ate - t_crit * se
    ci_upper = ate + t_crit * se

    # p-value
    t_stat = ate / se
    p_value = 2 * (1 - stats.t.cdf(abs(t_stat), df=min(n_c, n_t) - 1))

    # Effect size (Cohen's d)
    pooled_std = np.sqrt(((n_c-1)*var_c + (n_t-1)*var_t) / (n_c + n_t - 2))
    cohens_d = ate / pooled_std

    return {
        'ate': ate,
        'se': se,
        'ci': (ci_lower, ci_upper),
        'p_value': p_value,
        'cohens_d': cohens_d,
        'significant': p_value < alpha
    }

관측 데이터 방법론

Propensity Score Methods

Propensity Score:

\[e(X) = P(T=1 | X)\]

핵심 성질 (Rosenbaum & Rubin, 1983):

Ignorability가 \(X\)에 대해 성립하면, \(e(X)\)에 대해서도 성립:

\[Y(0), Y(1) \perp T | e(X)\]

Propensity Score Matching

from sklearn.linear_model import LogisticRegression
from sklearn.neighbors import NearestNeighbors
import numpy as np

class PropensityScoreMatching:
    def __init__(self, caliper=0.2):
        self.caliper = caliper  # 허용 거리 (표준편차 단위)

    def fit(self, X, T):
        """Propensity score 추정"""
        self.ps_model = LogisticRegression(max_iter=1000)
        self.ps_model.fit(X, T)
        self.propensity_scores = self.ps_model.predict_proba(X)[:, 1]
        self.X = X
        self.T = T
        return self

    def match(self, Y):
        """1:1 nearest neighbor matching"""
        treated_idx = np.where(self.T == 1)[0]
        control_idx = np.where(self.T == 0)[0]

        ps_treated = self.propensity_scores[treated_idx].reshape(-1, 1)
        ps_control = self.propensity_scores[control_idx].reshape(-1, 1)

        # Caliper
        caliper_threshold = self.caliper * self.propensity_scores.std()

        nn = NearestNeighbors(n_neighbors=1, metric='euclidean')
        nn.fit(ps_control)
        distances, indices = nn.kneighbors(ps_treated)

        # 매칭된 쌍
        matched_pairs = []
        for i, (dist, ctrl_idx) in enumerate(zip(distances, indices)):
            if dist[0] <= caliper_threshold:
                matched_pairs.append((treated_idx[i], control_idx[ctrl_idx[0]]))

        matched_pairs = np.array(matched_pairs)

        # ATE 추정
        Y_treated = Y[matched_pairs[:, 0]]
        Y_control = Y[matched_pairs[:, 1]]

        ate = (Y_treated - Y_control).mean()
        se = (Y_treated - Y_control).std() / np.sqrt(len(matched_pairs))

        return {
            'ate': ate,
            'se': se,
            'n_matched': len(matched_pairs),
            'n_treated': len(treated_idx),
            'matched_pairs': matched_pairs
        }

Inverse Probability Weighting (IPW)

\[\hat{\tau}_{IPW} = \frac{1}{n}\sum_{i=1}^{n}\left[\frac{T_i Y_i}{e(X_i)} - \frac{(1-T_i) Y_i}{1 - e(X_i)}\right]\]

Horvitz-Thompson 추정량:

각 관측치를 선택 확률의 역수로 가중:

def ipw_estimator(Y, T, propensity_scores, clip=0.01):
    """
    Inverse Probability Weighting
    """
    # Propensity score clipping (극단값 방지)
    ps_clipped = np.clip(propensity_scores, clip, 1 - clip)

    # IPW weights
    weights = T / ps_clipped + (1 - T) / (1 - ps_clipped)

    # Normalized IPW
    ate = (T * Y / ps_clipped).sum() / (T / ps_clipped).sum() - \
          ((1-T) * Y / (1-ps_clipped)).sum() / ((1-T) / (1-ps_clipped)).sum()

    return ate

Augmented IPW (AIPW / Doubly Robust)

\[\hat{\tau}_{AIPW} = \frac{1}{n}\sum_{i=1}^{n}\left[\hat{\mu}_1(X_i) - \hat{\mu}_0(X_i) + \frac{T_i(Y_i - \hat{\mu}_1(X_i))}{e(X_i)} - \frac{(1-T_i)(Y_i - \hat{\mu}_0(X_i))}{1-e(X_i)}\right]\]

Doubly Robust: Propensity model 또는 outcome model 중 하나만 맞아도 일치 추정량

참고 논문: - Robins, J.M. et al. (1994). "Estimation of Regression Coefficients When Some Regressors Are Not Always Observed". JASA.


Quasi-experimental Methods

Difference-in-Differences (DiD)

설정: - 처리군(Treatment)과 대조군(Control) - 처리 전(Pre)과 처리 후(Post) 기간

핵심 가정: Parallel Trends - 처리가 없었다면 두 그룹의 추세가 평행했을 것

\[\tau_{DiD} = (Y_{T,post} - Y_{T,pre}) - (Y_{C,post} - Y_{C,pre})\]
import pandas as pd
import statsmodels.formula.api as smf

def did_analysis(df, outcome, treatment, post, entity=None):
    """
    Difference-in-Differences

    df: DataFrame with columns [outcome, treatment, post, entity(optional)]
    """
    # DiD regression
    formula = f'{outcome} ~ {treatment} * {post}'

    if entity:
        formula += f' + C({entity})'  # Entity fixed effects

    model = smf.ols(formula, data=df).fit(cov_type='cluster', 
                                          cov_kwds={'groups': df[entity]} if entity else {})

    # DiD coefficient
    interaction_term = f'{treatment}:{post}'
    did_estimate = model.params[interaction_term]
    did_se = model.bse[interaction_term]
    did_pvalue = model.pvalues[interaction_term]

    return {
        'did_estimate': did_estimate,
        'se': did_se,
        'p_value': did_pvalue,
        'ci_lower': model.conf_int().loc[interaction_term, 0],
        'ci_upper': model.conf_int().loc[interaction_term, 1],
        'model_summary': model.summary()
    }

참고 논문: - Card, D. & Krueger, A.B. (1994). "Minimum Wages and Employment: A Case Study of the Fast-Food Industry". AER.

Regression Discontinuity Design (RDD)

설정: 임계값(cutoff) 기준으로 처리 배정

\[T_i = \mathbf{1}[X_i \geq c]\]

핵심 가정: 임계값 근처에서 잠재적 결과 함수가 연속

추정 (Local Linear Regression):

\[\hat{\tau}_{RDD} = \lim_{x \to c^+} \mathbb{E}[Y|X=x] - \lim_{x \to c^-} \mathbb{E}[Y|X=x]\]
from sklearn.linear_model import LinearRegression
import numpy as np

def rdd_analysis(X, Y, cutoff, bandwidth=None):
    """
    Regression Discontinuity Design
    """
    # Centering
    X_centered = X - cutoff
    T = (X >= cutoff).astype(int)

    # Bandwidth selection (Imbens-Kalyanaraman optimal bandwidth)
    if bandwidth is None:
        bandwidth = 1.06 * X_centered.std() * len(X)**(-1/5)  # Rule of thumb

    # Local sample
    local_mask = np.abs(X_centered) <= bandwidth
    X_local = X_centered[local_mask]
    Y_local = Y[local_mask]
    T_local = T[local_mask]

    # Local linear regression with interaction
    # Y = a + b*X + c*T + d*T*X + e
    design = np.column_stack([
        np.ones(len(X_local)),
        X_local,
        T_local,
        T_local * X_local
    ])

    model = LinearRegression(fit_intercept=False)
    model.fit(design, Y_local)

    # RDD estimate is coefficient on T
    rdd_estimate = model.coef_[2]

    # Bootstrap for standard error
    n_bootstrap = 1000
    bootstrap_estimates = []
    for _ in range(n_bootstrap):
        idx = np.random.choice(len(X_local), size=len(X_local), replace=True)
        model_boot = LinearRegression(fit_intercept=False)
        model_boot.fit(design[idx], Y_local[idx])
        bootstrap_estimates.append(model_boot.coef_[2])

    se = np.std(bootstrap_estimates)

    return {
        'rdd_estimate': rdd_estimate,
        'se': se,
        'bandwidth': bandwidth,
        'n_local': local_mask.sum()
    }

참고 논문: - Imbens, G.W. & Lemieux, T. (2008). "Regression Discontinuity Designs: A Guide to Practice". JoE. - Cattaneo, M.D. et al. (2019). "A Practical Introduction to Regression Discontinuity Designs". Cambridge.

Instrumental Variables (IV)

설정: 처리에 영향을 주지만 결과에 직접 영향을 주지 않는 도구변수 \(Z\)

가정: 1. Relevance: \(Cov(Z, T) \neq 0\) 2. Exclusion: \(Z\)\(Y\)\(T\)를 통해서만 영향 3. Independence: \(Z \perp (Y(0), Y(1))\)

2SLS (Two-Stage Least Squares):

Stage 1: \(T = \alpha_0 + \alpha_1 Z + \nu\) Stage 2: \(Y = \beta_0 + \beta_1 \hat{T} + \epsilon\)

\[\hat{\tau}_{IV} = \frac{Cov(Y, Z)}{Cov(T, Z)} = \frac{\text{Reduced Form}}{\text{First Stage}}\]
from linearmodels.iv import IV2SLS
import pandas as pd

def iv_analysis(df, outcome, treatment, instrument, covariates=None):
    """
    Instrumental Variables (2SLS)
    """
    # Formula
    if covariates:
        cov_str = ' + '.join(covariates)
        formula = f'{outcome} ~ 1 + {cov_str} + [{treatment} ~ {instrument}]'
    else:
        formula = f'{outcome} ~ 1 + [{treatment} ~ {instrument}]'

    model = IV2SLS.from_formula(formula, data=df).fit(cov_type='robust')

    return {
        'iv_estimate': model.params[treatment],
        'se': model.std_errors[treatment],
        'p_value': model.pvalues[treatment],
        'first_stage_f': model.first_stage.diagnostics['f.stat'].stat,  # F-stat > 10 권장
        'summary': model.summary
    }

참고 논문: - Angrist, J.D. et al. (1996). "Identification of Causal Effects Using Instrumental Variables". JASA.


Machine Learning for Causal Inference

Double Machine Learning (DML)

Neyman orthogonal scores를 사용하여 ML 추정 오차의 영향을 제거:

Partially Linear Model:

\[Y = \theta T + g(X) + U$$ $$T = m(X) + V\]

DML 절차: 1. \(\hat{g}(X) = \mathbb{E}[Y|X]\) 추정 (outcome model) 2. \(\hat{m}(X) = \mathbb{E}[T|X]\) 추정 (propensity model) 3. 잔차 계산: \(\tilde{Y} = Y - \hat{g}(X)\), \(\tilde{T} = T - \hat{m}(X)\) 4. \(\hat{\theta} = (\tilde{T}^T \tilde{T})^{-1} \tilde{T}^T \tilde{Y}\)

Cross-fitting: Overfitting 방지를 위해 sample splitting

from econml.dml import LinearDML, CausalForestDML
from sklearn.ensemble import RandomForestRegressor, RandomForestClassifier

def double_ml_analysis(X, T, Y):
    """
    Double Machine Learning
    """
    # Model specification
    model_y = RandomForestRegressor(n_estimators=100, max_depth=5, random_state=42)
    model_t = RandomForestClassifier(n_estimators=100, max_depth=5, random_state=42)

    # Linear DML (homogeneous effect)
    dml = LinearDML(
        model_y=model_y,
        model_t=model_t,
        discrete_treatment=True,
        cv=5,
        random_state=42
    )
    dml.fit(Y, T, X=X)

    ate = dml.ate(X)
    ate_inference = dml.ate_inference(X)

    return {
        'ate': ate,
        'se': ate_inference.std_err,
        'ci_lower': ate_inference.conf_int()[0][0],
        'ci_upper': ate_inference.conf_int()[0][1],
        'p_value': ate_inference.pvalue[0]
    }

참고 논문: - Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning for Treatment and Structural Parameters". The Econometrics Journal.

Causal Forest (Heterogeneous Treatment Effects)

CATE (Conditional Average Treatment Effect):

\[\tau(x) = \mathbb{E}[Y(1) - Y(0) | X = x]\]

알고리즘 (Generalized Random Forest): 1. Local moment conditions 기반 splitting 2. Honest estimation (train/estimate split) 3. 가중치 기반 국소 추정

from econml.grf import CausalForest
from econml.cate_interpreter import SingleTreeCateInterpreter

def causal_forest_analysis(X, T, Y):
    """
    Causal Forest for Heterogeneous Treatment Effects
    """
    cf = CausalForest(
        n_estimators=1000,
        min_samples_leaf=5,
        max_depth=None,
        honest=True,
        random_state=42
    )
    cf.fit(Y, T, X=X)

    # Individual treatment effects
    cate = cf.predict(X)
    cate_inference = cf.predict_interval(X, alpha=0.05)

    # Feature importance for heterogeneity
    feature_importance = cf.feature_importances_

    # Policy tree interpretation
    interpreter = SingleTreeCateInterpreter(max_depth=3)
    interpreter.interpret(cf, X)

    return {
        'cate': cate,
        'cate_lower': cate_inference[0],
        'cate_upper': cate_inference[1],
        'ate': cate.mean(),
        'feature_importance': feature_importance,
        'policy_tree': interpreter
    }

참고 논문: - Wager, S. & Athey, S. (2018). "Estimation and Inference of Heterogeneous Treatment Effects using Random Forests". JASA. - Athey, S. et al. (2019). "Generalized Random Forests". Annals of Statistics.

Meta-learners (CATE Estimation)

Learner 접근법 장점 단점
T-learner 처리군/대조군 별도 모델 유연함 데이터 분할로 효율 저하
S-learner 단일 모델, T를 특성으로 간단 효과 포착 어려움
X-learner T → S → 가중 결합 불균형 처리에 강함 복잡
R-learner Residual-on-residual 이론적 보장
DR-learner Doubly robust 효율적, 강건
from econml.metalearners import TLearner, SLearner, XLearner
from sklearn.ensemble import GradientBoostingRegressor

# T-Learner
t_learner = TLearner(models=GradientBoostingRegressor())
t_learner.fit(Y, T, X=X)
cate_t = t_learner.effect(X)

# S-Learner
s_learner = SLearner(overall_model=GradientBoostingRegressor())
s_learner.fit(Y, T, X=X)
cate_s = s_learner.effect(X)

# X-Learner (handles imbalanced treatment)
x_learner = XLearner(models=GradientBoostingRegressor())
x_learner.fit(Y, T, X=X)
cate_x = x_learner.effect(X)

참고 논문: - Kunzel, S.R. et al. (2019). "Metalearners for Estimating Heterogeneous Treatment Effects using Machine Learning". PNAS.


Sensitivity Analysis

미관측 교란 분석

Rosenbaum Bounds:

처리 배정에 미관측 교란이 있을 때 효과 추정의 변동 범위

E-value:

관측된 효과를 없앨 수 있는 미관측 교란의 최소 강도

\[E\text{-value} = RR + \sqrt{RR \cdot (RR - 1)}\]
def calculate_e_value(point_estimate, ci_lower=None):
    """
    E-value 계산
    point_estimate: Risk Ratio 또는 Odds Ratio
    """
    rr = point_estimate
    e_value = rr + np.sqrt(rr * (rr - 1))

    if ci_lower is not None and ci_lower > 1:
        e_value_ci = ci_lower + np.sqrt(ci_lower * (ci_lower - 1))
    else:
        e_value_ci = 1

    return {
        'e_value_point': e_value,
        'e_value_ci': e_value_ci
    }

참고 논문: - VanderWeele, T.J. & Ding, P. (2017). "Sensitivity Analysis in Observational Research: Introducing the E-Value". Annals of Internal Medicine.


실무 적용 가이드

방법 선택 흐름도

시작
├── RCT/A/B 테스트 가능?
│   ├── 예 → RCT
│   └── 아니오 →
│       │
│       ├── 처리 배정에 임계값 있음?
│       │   └── 예 → RDD
│       │
│       ├── 처리 전후 데이터 + 대조군?
│       │   └── 예 → DiD
│       │
│       ├── 좋은 도구변수 있음?
│       │   └── 예 → IV
│       │
│       └── 관측 데이터만?
│           │
│           ├── Ignorability 믿을 수 있음?
│           │   ├── 예 → Matching, IPW, DML
│           │   └── 아니오 → Sensitivity Analysis 필수
│           │
│           └── 이질적 효과 관심?
│               ├── 예 → Causal Forest, Meta-learners
│               └── 아니오 → DML, AIPW

체크리스트

  1. 연구 질문 명확화
  2. ATE vs ATT vs CATE
  3. 인과 효과의 방향과 해석

  4. 가정 점검

  5. Ignorability/Unconfoundedness
  6. Overlap/Positivity
  7. SUTVA
  8. (DiD) Parallel Trends
  9. (RDD) Continuity
  10. (IV) Relevance, Exclusion

  11. 데이터 진단

  12. Covariate balance 확인
  13. Propensity score 분포
  14. Common support 영역

  15. 민감도 분석

  16. E-value 계산
  17. 다른 방법론으로 교차 검증

하위 문서

주제 설명 링크
Double Machine Learning ML 기반 인과추론 상세 double-machine-learning.md

참고 문헌

교과서

  • Hernan, M.A. & Robins, J.M. (2020). "Causal Inference: What If". Chapman & Hall. (무료 온라인)
  • Imbens, G.W. & Rubin, D.B. (2015). "Causal Inference for Statistics, Social, and Biomedical Sciences". Cambridge.
  • Pearl, J. et al. (2016). "Causal Inference in Statistics: A Primer". Wiley.
  • Angrist, J.D. & Pischke, J.S. (2009). "Mostly Harmless Econometrics". Princeton.

핵심 논문

  • Rubin, D.B. (1974). "Estimating Causal Effects". Journal of Educational Psychology.
  • Rosenbaum, P.R. & Rubin, D.B. (1983). "The Central Role of the Propensity Score". Biometrika.
  • Chernozhukov, V. et al. (2018). "Double/Debiased Machine Learning". The Econometrics Journal.
  • Wager, S. & Athey, S. (2018). "Causal Forests". JASA.

라이브러리

  • EconML (Microsoft): https://github.com/microsoft/EconML
  • DoWhy: https://github.com/py-why/dowhy
  • CausalML (Uber): https://github.com/uber/causalml
  • grf (R): https://github.com/grf-labs/grf