Feature Engineering (피처 엔지니어링)¶

메타 정보¶

항목	내용
분류	Data Preprocessing / Feature Construction / Feature Selection
핵심 논문	"Deep Feature Synthesis" (Kanter & Veeramachaneni, IEEE DSAA 2015), "ExploreKit: Automatic Feature Generation and Selection" (Katz et al., IEEE ICDM 2016), "OpenFE: Automated Feature Generation beyond Expert-level Performance" (Zhang et al., ICML 2024), "CAAFE: Context-Aware Automated Feature Engineering" (Hollmann et al., NeurIPS 2023), "LLM-FE: Automated Feature Engineering with LLMs as Evolutionary Optimizers" (arXiv 2025), "Shap-Select: Lightweight Feature Selection Using SHAP Values and Regression" (arXiv 2024)
주요 저자	James Max Kanter (Featuretools/DFS), Gilad Katz (ExploreKit), Tianping Zhang (OpenFE), Noah Hollmann & Frank Hutter (CAAFE)
핵심 개념	원시 데이터로부터 머신러닝 모델의 예측 성능을 극대화하는 입력 표현(feature)을 설계, 생성, 선택하는 과정
관련 분야	AutoML, Data-Centric AI, Tabular Learning, Transfer Learning, Dimensionality Reduction

정의¶

Feature Engineering은 원시 데이터를 머신러닝 모델이 효과적으로 학습할 수 있는 수치적 표현으로 변환하는 전 과정을 포괄한다. 도메인 지식 기반의 수동 피처 설계부터, 자동 피처 생성(Automated Feature Generation), 피처 선택(Feature Selection), 피처 변환(Feature Transformation)까지를 포함한다. "Applied machine learning is basically feature engineering" (Andrew Ng)이라는 격언이 보여주듯, 모델 성능의 상당 부분은 피처의 품질에 의해 결정된다.

Feature Engineering 파이프라인:

Raw Data (정형/비정형)
  |
  v
[1] Feature Extraction (피처 추출)
  - 텍스트: TF-IDF, embeddings
  - 이미지: CNN features
  - 시계열: lag, rolling stats
  |
  v
[2] Feature Construction (피처 생성)
  - 도메인 기반: 비율, 차이, 교차
  - 자동화: DFS, OpenFE, CAAFE
  - 수학 연산: log, sqrt, polynomial
  |
  v
[3] Feature Transformation (피처 변환)
  - 스케일링: StandardScaler, MinMax
  - 인코딩: One-Hot, Target Encoding
  - 차원축소: PCA, t-SNE, UMAP
  |
  v
[4] Feature Selection (피처 선택)
  - Filter: 상관계수, MI, chi-squared
  - Wrapper: RFE, Boruta
  - Embedded: L1/L2, tree importance
  |
  v
Final Feature Set --> Model Training

핵심 원리¶

1. Feature Construction (피처 생성)¶

도메인 지식 또는 자동 탐색을 통해 원시 피처로부터 새로운 피처를 생성한다.

import pandas as pd
import numpy as np
from sklearn.preprocessing import PolynomialFeatures

# === 수동 피처 생성 (도메인 지식 기반) ===

def create_domain_features(df: pd.DataFrame) -> pd.DataFrame:
    """부동산 데이터 예시: 도메인 지식 기반 피처 생성"""
    df = df.copy()

    # 비율 피처 (ratio)
    df['price_per_sqm'] = df['price'] / df['area_sqm']
    df['room_density'] = df['rooms'] / df['area_sqm']

    # 차이 피처 (difference)
    df['price_gap'] = df['asking_price'] - df['appraised_value']

    # 교차 피처 (interaction)
    df['floor_area_interaction'] = df['floor'] * df['area_sqm']

    # 집계 피처 (aggregation)
    df['district_avg_price'] = df.groupby('district')['price'].transform('mean')
    df['price_vs_district'] = df['price'] / df['district_avg_price']

    # 시간 피처 (temporal)
    df['sale_month'] = pd.to_datetime(df['sale_date']).dt.month
    df['days_on_market'] = (
        pd.to_datetime(df['sale_date']) - pd.to_datetime(df['list_date'])
    ).dt.days

    # 비선형 변환
    df['log_price'] = np.log1p(df['price'])
    df['area_squared'] = df['area_sqm'] ** 2

    return df


# === 다항 피처 생성 ===

def polynomial_features(X: np.ndarray, degree: int = 2) -> np.ndarray:
    """
    PolynomialFeatures: 모든 피처 조합의 다항식 생성
    주의: degree=2에서 n개 피처 -> n*(n+1)/2 + n + 1개로 폭발
    """
    poly = PolynomialFeatures(
        degree=degree,
        interaction_only=False,  # True면 교차항만
        include_bias=False
    )
    return poly.fit_transform(X)

2. Feature Encoding (범주형 피처 인코딩)¶

범주형 변수를 수치형으로 변환하는 다양한 전략이 존재하며, 각각의 trade-off가 있다.

방법	차원 증가	순서 보존	타겟 누수 위험	적합한 경우
One-Hot Encoding	O(k)	X	없음	저카디널리티 (k < 20)
Label Encoding	없음	O	없음	트리 모델
Target Encoding	없음	X	있음 (CV 필요)	고카디널리티
Frequency Encoding	없음	X	낮음	빈도가 의미 있을 때
Binary Encoding	O(log k)	X	없음	중간 카디널리티
WoE (Weight of Evidence)	없음	X	있음	이진 분류, 신용평가
CatBoost Encoding	없음	X	낮음 (순서 기반)	CatBoost 내장

import category_encoders as ce
from sklearn.model_selection import KFold

def target_encoding_with_cv(
    train: pd.DataFrame,
    target: str,
    cat_cols: list,
    n_folds: int = 5,
    smoothing: float = 10.0
) -> pd.DataFrame:
    """
    Target Encoding with K-Fold CV (타겟 누수 방지)

    핵심: 학습 데이터에서 fold-out 방식으로 인코딩하여
          타겟 정보가 직접 피처에 누출되는 것을 방지

    smoothing: 글로벌 평균과의 혼합 비율
      encoded = (count * cat_mean + smoothing * global_mean) / (count + smoothing)
    """
    train = train.copy()
    kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)

    for col in cat_cols:
        train[f'{col}_target_enc'] = np.nan

        for train_idx, val_idx in kf.split(train):
            # fold 내 학습 데이터로 인코딩 매핑 생성
            fold_train = train.iloc[train_idx]
            global_mean = fold_train[target].mean()

            agg = fold_train.groupby(col)[target].agg(['mean', 'count'])
            agg['smoothed'] = (
                (agg['count'] * agg['mean'] + smoothing * global_mean) 
                / (agg['count'] + smoothing)
            )
            mapping = agg['smoothed'].to_dict()

            # fold 외 검증 데이터에 적용
            train.loc[
                train.index[val_idx], f'{col}_target_enc'
            ] = train.iloc[val_idx][col].map(mapping).fillna(global_mean)

    return train

3. Feature Selection (피처 선택)¶

불필요한 피처를 제거하여 과적합을 방지하고 모델 해석성을 높인다. 세 가지 접근법이 존재한다.

from sklearn.feature_selection import (
    mutual_info_classif, SelectKBest, RFE
)
from sklearn.ensemble import RandomForestClassifier
import shap

# === Filter Method: 통계량 기반 ===

def filter_selection(X: pd.DataFrame, y: pd.Series, k: int = 20):
    """
    Filter: 모델 독립적, 빠르지만 피처 간 상호작용 무시
    - 상관계수: 선형 관계만 포착
    - Mutual Information: 비선형 관계도 포착
    - Chi-squared: 범주형 피처 전용
    """
    # Mutual Information
    mi_scores = mutual_info_classif(X, y, random_state=42)
    mi_ranking = pd.Series(mi_scores, index=X.columns).sort_values(ascending=False)

    # 다중공선성 제거 (상관계수 기반)
    corr_matrix = X.corr().abs()
    upper = corr_matrix.where(
        np.triu(np.ones(corr_matrix.shape), k=1).astype(bool)
    )
    # 상관계수 > 0.95인 피처 중 MI 점수가 낮은 쪽 제거
    to_drop = []
    for col in upper.columns:
        high_corr = upper.index[upper[col] > 0.95].tolist()
        for hc in high_corr:
            if mi_ranking[col] >= mi_ranking[hc]:
                to_drop.append(hc)
            else:
                to_drop.append(col)

    return mi_ranking.head(k).index.tolist(), list(set(to_drop))


# === Wrapper Method: Boruta ===

def boruta_selection(X: pd.DataFrame, y: pd.Series):
    """
    Boruta: Random Forest 기반 Wrapper Method
    - Shadow features (원본 피처 셔플)와 비교하여 통계적 유의성 검정
    - 모든 관련 피처를 찾는 all-relevant 방식 (vs. RFE의 minimal-optimal)
    """
    from boruta import BorutaPy

    rf = RandomForestClassifier(
        n_estimators=200,
        n_jobs=-1,
        random_state=42,
        max_depth=7
    )
    boruta = BorutaPy(
        estimator=rf,
        n_estimators='auto',
        max_iter=100,
        random_state=42
    )
    boruta.fit(X.values, y.values)

    selected = X.columns[boruta.support_].tolist()
    tentative = X.columns[boruta.support_weak_].tolist()

    return {
        'confirmed': selected,
        'tentative': tentative,
        'ranking': dict(zip(X.columns, boruta.ranking_))
    }


# === Embedded Method: SHAP 기반 ===

def shap_feature_importance(
    model, X: pd.DataFrame, top_k: int = 20
) -> pd.DataFrame:
    """
    SHAP (SHapley Additive exPlanations) 기반 피처 중요도
    - 게임이론의 Shapley value를 ML에 적용
    - 각 피처의 한계 기여도를 모든 피처 조합에 대해 계산
    - 모델 종류별 최적화: TreeExplainer, KernelExplainer 등
    """
    explainer = shap.TreeExplainer(model)
    shap_values = explainer.shap_values(X)

    # 다중 클래스인 경우 평균
    if isinstance(shap_values, list):
        importance = np.mean([np.abs(sv).mean(axis=0) for sv in shap_values], axis=0)
    else:
        importance = np.abs(shap_values).mean(axis=0)

    result = pd.DataFrame({
        'feature': X.columns,
        'shap_importance': importance
    }).sort_values('shap_importance', ascending=False)

    return result.head(top_k)

4. Automated Feature Engineering (자동 피처 엔지니어링)¶

수동 피처 설계의 한계를 극복하기 위한 자동화 기법들.

4.1 Deep Feature Synthesis (DFS)¶

import featuretools as ft

def deep_feature_synthesis_example():
    """
    Deep Feature Synthesis (Kanter & Veeramachaneni, 2015)

    핵심 아이디어:
    - 관계형 데이터의 엔티티 간 관계를 따라 피처를 자동 생성
    - 변환 프리미티브(transform)와 집계 프리미티브(agg)의 조합
    - 깊이(depth)를 늘리면 복합 피처 생성 (e.g., MEAN(SUM(amount)))

    프리미티브 예시:
      Transform: abs, log, sqrt, year, month, weekday, is_weekend
      Aggregation: mean, sum, count, std, max, min, mode, trend
    """
    # 엔티티셋 생성 (관계형 데이터)
    es = ft.EntitySet(id="transactions")

    es = es.add_dataframe(
        dataframe_name="customers",
        dataframe=customers_df,
        index="customer_id"
    )
    es = es.add_dataframe(
        dataframe_name="transactions",
        dataframe=transactions_df,
        index="transaction_id",
        time_index="transaction_time"
    )
    es = es.add_relationship("customers", "customer_id",
                              "transactions", "customer_id")

    # DFS 실행
    feature_matrix, feature_defs = ft.dfs(
        entityset=es,
        target_dataframe_name="customers",
        max_depth=2,                    # 피처 합성 깊이
        agg_primitives=["mean", "sum", "count", "std", "max", "min"],
        trans_primitives=["month", "weekday", "is_weekend"],
        n_jobs=-1
    )
    # 결과: MEAN(transactions.amount), STD(transactions.amount),
    #       COUNT(transactions), MEAN(transactions.MONTH(transaction_time)), ...

    return feature_matrix, feature_defs

4.2 OpenFE (ICML 2024)¶

from openfe import OpenFE, transform

def openfe_example(X_train, y_train, X_test):
    """
    OpenFE (Zhang et al., ICML 2024)

    핵심 혁신:
    1. Feature Boosting: LightGBM의 잔차(residual)에 대해
       후보 피처의 증분 성능을 효율적으로 평가
    2. Two-stage Pruning: 
       Stage 1: 빠른 통계 기반 필터링 (후보 수 대폭 감소)
       Stage 2: 실제 모델 성능 기반 정밀 평가

    장점:
    - 전문가 수준의 피처 생성 자동화
    - 49개 OpenML 벤치마크에서 기존 AutoFE 기법 대비 우수
    - 시간 효율성 (ExploreKit 대비 10-100x 빠름)
    """
    ofe = OpenFE()
    features = ofe.fit(
        data=X_train,
        label=y_train,
        n_jobs=8,
        n_data_for_train=10000,     # 평가용 서브샘플
        task='classification',       # or 'regression'
        feature_boosting=True,       # 핵심: 증분 성능 평가
        stage2_params={'verbose': -1}
    )

    # 상위 피처 적용
    X_train_new, X_test_new = transform(
        X_train, X_test,
        features[:20],              # 상위 20개 피처
        n_jobs=8
    )

    return X_train_new, X_test_new


# OpenFE 내부 동작 원리 (간략화)
def feature_boosting_concept(X, y, candidate_feature):
    """
    Feature Boosting 핵심 아이디어:

    1. 기존 피처로 LightGBM 학습 -> 잔차(residual) 계산
    2. 후보 피처 하나만으로 잔차를 예측하는 단일 트리 학습
    3. 잔차 예측 성능 = 해당 피처의 증분 기여도

    이 방식으로 각 후보를 독립적으로 평가하여
    O(n * k) 대신 O(n + k) 수준으로 효율화
    (n: 기존 피처 수, k: 후보 피처 수)
    """
    import lightgbm as lgb

    # Step 1: 기존 모델로 잔차 계산
    base_model = lgb.LGBMClassifier().fit(X, y)
    base_pred = base_model.predict_proba(X)[:, 1]
    residual = y - base_pred

    # Step 2: 후보 피처 하나로 잔차 예측
    single_tree = lgb.LGBMRegressor(
        n_estimators=1, max_depth=3
    ).fit(candidate_feature.values.reshape(-1, 1), residual)

    # Step 3: 잔차 예측 성능 = 증분 기여도
    score = single_tree.score(
        candidate_feature.values.reshape(-1, 1), residual
    )
    return score

4.3 CAAFE: LLM 기반 피처 엔지니어링 (NeurIPS 2023)¶

from caafe import CAAFEClassifier

def caafe_example(X_train, y_train, X_test, dataset_description: str):
    """
    CAAFE (Hollmann et al., NeurIPS 2023)

    핵심 혁신:
    - LLM(GPT-4 등)에 데이터셋 설명과 피처 정보를 제공
    - LLM이 도메인 지식을 활용하여 의미론적으로 유의미한 피처를 생성
    - 반복적 피드백: 생성된 피처의 성능을 LLM에 전달하여 개선

    장점:
    - 도메인 지식이 없어도 의미 있는 피처 생성 가능
    - 사람이 해석 가능한 피처 (코드 + 설명 제공)
    - 기존 AutoFE 대비 적은 후보로 높은 성능
    """
    clf = CAAFEClassifier(
        base_classifier="lightgbm",
        llm_model="gpt-4",
        iterations=10,              # LLM 반복 횟수
        dataset_description=dataset_description
    )
    clf.fit(X_train, y_train)
    predictions = clf.predict(X_test)

    # 생성된 피처 확인
    for feat in clf.generated_features_:
        print(f"Name: {feat['name']}")
        print(f"Code: {feat['code']}")
        print(f"Explanation: {feat['explanation']}")

    return predictions

주요 방법론 비교¶

방법	유형	입력	장점	단점	연산 비용
수동 설계	Manual	도메인 지식	해석 가능, 고품질	확장 불가, 전문가 의존	인력 비용 높음
PolynomialFeatures	Brute-force	수치형	구현 단순	차원 폭발	O(n^d)
Deep Feature Synthesis	Automated (관계형)	EntitySet	관계형 데이터에 강점	비관계형 데이터 부적합	중간
ExploreKit	Automated (탐색)	테이블	다양한 연산자	느림, 대규모 탐색	매우 높음
OpenFE	Automated (부스팅)	테이블	빠르고 정확	비관계형 데이터만	낮음-중간
CAAFE	LLM 기반	테이블 + 설명	의미론적 피처, 해석 가능	LLM API 비용, 비결정적	API 호출 비용
LLM-FE	LLM + 진화 최적화	테이블 + 설명	진화적 탐색으로 더 넓은 공간	높은 API 비용	매우 높음

Feature Selection 방법론 비교¶

방법	분류	모델 의존성	피처 상호작용 고려	속도
Correlation	Filter	없음	X (쌍별 선형만)	매우 빠름
Mutual Information	Filter	없음	X (쌍별 비선형)	빠름
Chi-squared	Filter	없음	X	빠름
RFE	Wrapper	모델 필요	O (간접적)	느림
Boruta	Wrapper	RF 기반	O (RF 통해)	느림
L1 (Lasso)	Embedded	선형 모델	X	빠름
Tree Importance	Embedded	트리 모델	O	빠름
SHAP	Embedded	모델 필요	O (Shapley value)	중간
Permutation Importance	Model-agnostic	모델 필요	O (간접적)	중간

실전 피처 엔지니어링 파이프라인¶

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import (
    StandardScaler, OneHotEncoder, FunctionTransformer
)
from sklearn.impute import SimpleImputer
import numpy as np
import pandas as pd

def build_feature_pipeline(
    numeric_cols: list,
    categorical_cols: list,
    high_cardinality_cols: list,
    datetime_cols: list
) -> ColumnTransformer:
    """
    실전 피처 엔지니어링 파이프라인

    설계 원칙:
    1. 누수 방지: fit은 train에만, transform은 train/test 모두
    2. 결측치 처리 -> 변환 -> 스케일링 순서
    3. 범주형: 카디널리티에 따라 인코딩 전략 분리
    4. 시간: 주기적 인코딩 (sin/cos)으로 연속성 보존
    """
    # 수치형 파이프라인
    numeric_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='median')),
        ('log_transform', FunctionTransformer(
            np.log1p, validate=True
        )),
        ('scaler', StandardScaler())
    ])

    # 저카디널리티 범주형
    categorical_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='most_frequent')),
        ('encoder', OneHotEncoder(
            handle_unknown='ignore',
            sparse_output=False,
            min_frequency=0.01    # 1% 미만 빈도는 기타 처리
        ))
    ])

    # 고카디널리티 범주형 (Target Encoding은 별도 처리 권장)
    high_card_pipeline = Pipeline([
        ('imputer', SimpleImputer(strategy='constant', fill_value='MISSING')),
        ('encoder', ce.TargetEncoder(
            smoothing=10.0,
            min_samples_leaf=20
        ))
    ])

    # 날짜 피처 (주기적 인코딩)
    def datetime_features(X):
        """시간 피처의 주기적 인코딩 (sin/cos)"""
        result = pd.DataFrame(index=X.index)
        for col in X.columns:
            dt = pd.to_datetime(X[col])
            # 월 (12개월 주기)
            result[f'{col}_month_sin'] = np.sin(2 * np.pi * dt.dt.month / 12)
            result[f'{col}_month_cos'] = np.cos(2 * np.pi * dt.dt.month / 12)
            # 요일 (7일 주기)
            result[f'{col}_dow_sin'] = np.sin(2 * np.pi * dt.dt.dayofweek / 7)
            result[f'{col}_dow_cos'] = np.cos(2 * np.pi * dt.dt.dayofweek / 7)
            # 시간 (24시간 주기)
            result[f'{col}_hour_sin'] = np.sin(2 * np.pi * dt.dt.hour / 24)
            result[f'{col}_hour_cos'] = np.cos(2 * np.pi * dt.dt.hour / 24)
        return result

    datetime_pipeline = Pipeline([
        ('extract', FunctionTransformer(datetime_features))
    ])

    # 전체 조합
    preprocessor = ColumnTransformer(
        transformers=[
            ('num', numeric_pipeline, numeric_cols),
            ('cat', categorical_pipeline, categorical_cols),
            ('high_cat', high_card_pipeline, high_cardinality_cols),
            ('dt', datetime_pipeline, datetime_cols),
        ],
        remainder='drop'
    )

    return preprocessor

피처 엔지니어링 안티패턴¶

안티패턴	문제	해결
타겟 누수 (Target Leakage)	타겟에서 파생된 피처 사용 -> 과적합	시간적 인과관계 검증, CV 기반 인코딩
무한 피처 생성	차원의 저주, 노이즈 과적합	피처 선택 병행, 중요도 기반 프루닝
스케일링 미적용	거리 기반 모델(KNN, SVM) 성능 저하	StandardScaler, RobustScaler
결측치 무시	정보 손실 또는 편향	결측 지시자(indicator) 피처 추가
학습/테스트 동시 fit	미래 정보 누출	Pipeline으로 fit/transform 분리
고카디널리티에 One-Hot	희소 행렬, 메모리 폭발	Target/Frequency Encoding 사용

핵심 참고 자료¶

논문/자료	저자	발표	기여
Deep Feature Synthesis	Kanter & Veeramachaneni	IEEE DSAA 2015	관계형 데이터 자동 FE의 기초
ExploreKit	Katz et al.	IEEE ICDM 2016	메타러닝 기반 피처 탐색
OpenFE	Zhang et al.	ICML 2024	Feature Boosting, 효율적 자동 FE
CAAFE	Hollmann, Muller, Hutter	NeurIPS 2023	LLM 기반 Context-Aware FE
LLM-FE	arXiv 2025	arXiv	LLM + 진화 최적화 FE
Shap-Select	arXiv 2024	arXiv	SHAP + 회귀 기반 경량 선택
Feature Engineering and Selection	Kuhn & Johnson	2019 (Book)	실전 FE 교과서
Automated Data Processing Survey	Mumuni & Mumuni	Journal of Big Data 2024	자동 전처리/FE 종합 서베이

라이브러리	주요 기능	비고
scikit-learn	전처리, 선택, 파이프라인	표준 도구
Featuretools	Deep Feature Synthesis	관계형 데이터 자동 FE
OpenFE	Feature Boosting 기반 자동 생성	ICML 2024
CAAFE	LLM 기반 Context-Aware FE	NeurIPS 2023
category_encoders	20+ 범주형 인코딩 방법	Target, WoE, Binary 등
tsfresh	시계열 자동 피처 추출	789개 피처 자동 계산
Boruta (boruta_py)	All-relevant 피처 선택	RF + shadow features
shap	SHAP 기반 피처 중요도	해석 가능성