Graph Analytics (그래프 분석)¶

메타 정보¶

항목	내용
분류	Graph Theory / Network Science / GNN
핵심 논문	"Semi-Supervised Classification with Graph Convolutional Networks" (ICLR 2017), "Inductive Representation Learning on Large Graphs" (NeurIPS 2017), "Graph Attention Networks" (ICLR 2018)
주요 저자	Thomas Kipf, Max Welling (GCN); William Hamilton, Rex Ying, Jure Leskovec (GraphSAGE); Petar Velickovic et al. (GAT)
핵심 개념	그래프 구조 데이터에서 노드/엣지/그래프 수준의 패턴을 학습하고 예측
관련 분야	Social Network Analysis, Knowledge Graphs, Drug Discovery, Fraud Detection

정의¶

Graph Analytics는 노드(node)와 엣지(edge)로 구성된 그래프 구조 데이터에서 패턴과 인사이트를 추출하는 분석 기법이다. 전통적 그래프 이론부터 최신 GNN(Graph Neural Networks)까지를 포괄한다.

그래프 표현:

  G = (V, E)
  V: 노드 집합 (사용자, 분자, 문서 등)
  E: 엣지 집합 (관계, 결합, 인용 등)

  인접 행렬 A:          특성 행렬 X:
  [0 1 1 0]            [x_1^T]   <- 노드 1의 특성 벡터
  [1 0 0 1]            [x_2^T]
  [1 0 0 1]            [x_3^T]
  [0 1 1 0]            [x_4^T]

  노드 특성:  X in R^{N x F}  (N: 노드 수, F: 특성 차원)
  인접 행렬:  A in R^{N x N}
  엣지 특성:  E in R^{|E| x D} (선택적)

그래프 데이터의 유형¶

유형	특징	예시
Undirected	방향 없는 엣지	친구 관계, 분자 결합
Directed	방향 있는 엣지	팔로우, 인용, 웹 링크
Weighted	가중치 있는 엣지	거리, 유사도, 거래 금액
Bipartite	두 유형의 노드	사용자-상품, 저자-논문
Heterogeneous	다중 노드/엣지 타입	Knowledge Graph
Temporal/Dynamic	시간에 따라 변화	소셜 네트워크 진화
Hypergraph	엣지가 2개 이상 노드 연결	그룹 관계, 공저

전통적 그래프 분석¶

1. 중심성 지표 (Centrality Measures)¶

노드의 "중요도"를 정량화하는 지표들.

중심성 지표:

Degree Centrality:
  C_D(v) = deg(v) / (N - 1)
  직관: 직접 연결이 많을수록 중요

Betweenness Centrality:
  C_B(v) = sum_{s != v != t} (sigma_{st}(v) / sigma_{st})
  직관: 다른 노드들 사이의 최단 경로에 많이 포함될수록 중요
  응용: 네트워크 병목점, 브릿지 노드 탐지

Closeness Centrality:
  C_C(v) = (N - 1) / sum_{u != v} d(v, u)
  직관: 다른 모든 노드에 평균적으로 가까울수록 중요

Eigenvector Centrality:
  C_E(v) = (1/lambda) * sum_{u in N(v)} C_E(u)
  직관: 중요한 이웃을 가진 노드가 중요
  확장: PageRank (Google의 웹페이지 순위)

PageRank:
  PR(v) = (1-d)/N + d * sum_{u in B(v)} PR(u) / L(u)
  d: damping factor (보통 0.85)
  B(v): v를 가리키는 노드 집합
  L(u): u의 outgoing link 수

import networkx as nx
import numpy as np

# 그래프 생성
G = nx.karate_club_graph()

# 중심성 계산
degree = nx.degree_centrality(G)
betweenness = nx.betweenness_centrality(G)
closeness = nx.closeness_centrality(G)
pagerank = nx.pagerank(G, alpha=0.85)

# 상위 5개 노드
for name, centrality in [('Degree', degree), ('Betweenness', betweenness),
                          ('PageRank', pagerank)]:
    top5 = sorted(centrality.items(), key=lambda x: x[1], reverse=True)[:5]
    print(f"\n{name} Top 5:")
    for node, score in top5:
        print(f"  Node {node}: {score:.4f}")

2. 커뮤니티 탐지 (Community Detection)¶

그래프 내에서 밀접하게 연결된 노드 그룹(커뮤니티)을 발견하는 기법.

커뮤니티 구조:

+---+---+---+     +---+---+
| a - b - c |-----|d  - e |
| |   |     |     |  \   |
| f - g     |     |   h  |
+-----------+     +------+
  Community 1     Community 2

내부 연결(intra) >> 외부 연결(inter)

Modularity¶

Newman & Girvan (2004). 커뮤니티 분할의 품질을 측정하는 지표.

Modularity Q:

  Q = (1/2m) * sum_{ij} [A_{ij} - (k_i * k_j) / (2m)] * delta(c_i, c_j)

  A_{ij}: 인접 행렬
  k_i: 노드 i의 degree
  m: 총 엣지 수
  delta(c_i, c_j): 같은 커뮤니티면 1, 아니면 0

  Q in [-0.5, 1]
  Q > 0.3: 의미 있는 커뮤니티 구조 존재

주요 알고리즘¶

알고리즘	원리	시간 복잡도	특징
Louvain	Modularity 탐욕적 최적화	O(N log N)	빠르고 실용적
Leiden	Louvain 개선 (connected 보장)	O(N log N)	더 안정적 결과
Label Propagation	이웃의 레이블 전파	O(m)	매우 빠름
Spectral Clustering	Laplacian 고유벡터 분해	O(N^3)	이론적 보장
Girvan-Newman	높은 betweenness 엣지 제거	O(m^2 * N)	정확하나 느림
Infomap	정보 흐름 최소화	O(m)	방향 그래프에 강함

import networkx as nx
from networkx.algorithms import community

G = nx.karate_club_graph()

# Louvain (networkx 내장)
communities_louvain = community.louvain_communities(G, resolution=1.0)
print(f"Louvain: {len(communities_louvain)} communities")

# Modularity 계산
modularity = community.modularity(G, communities_louvain)
print(f"Modularity: {modularity:.4f}")

# Label Propagation
communities_lpa = community.label_propagation_communities(G)
print(f"LPA: {len(list(communities_lpa))} communities")

3. Link Prediction¶

존재하지 않는 엣지의 형성 가능성을 예측한다.

Link Prediction:

현재 그래프:          예측:
a -- b               a -- b
|    |               |  X |    <- a-d 연결 예측?
c -- d               c -- d

점수 함수 (heuristic):

  Common Neighbors:    |N(u) intersection N(v)|
  Jaccard:             |N(u) intersection N(v)| / |N(u) union N(v)|
  Adamic-Adar:         sum_{w in N(u) intersection N(v)} 1/log(|N(w)|)
  Preferential Attach: |N(u)| * |N(v)|

  Adamic-Adar 직관:
  "공통 이웃이 있되, 그 이웃이 선택적이면 더 의미있다"

4. Graph Embedding (Node2Vec, DeepWalk)¶

그래프의 노드를 저차원 벡터 공간에 매핑한다.

Graph Embedding:

그래프 공간              벡터 공간 (d차원)
  a -- b                 a(0.2, 0.8)  b(0.3, 0.7)
  |    |       embed     
  c -- d      ------>    c(0.1, 0.6)  d(0.4, 0.5)

DeepWalk (Perozzi et al., KDD 2014):
  1. 랜덤 워크로 노드 시퀀스 생성
  2. Word2Vec (Skip-gram)으로 임베딩 학습

Node2Vec (Grover & Leskovec, KDD 2016):
  1. BFS/DFS 편향된 랜덤 워크
  2. p, q 파라미터로 탐색 전략 제어
     p: return parameter (이전 노드로 돌아갈 확률)
     q: in-out parameter (BFS vs DFS 편향)
     q < 1: BFS 성향 (지역 구조)
     q > 1: DFS 성향 (전역 구조)

from node2vec import Node2Vec

# Node2Vec 임베딩
node2vec = Node2Vec(G, dimensions=64, walk_length=30, 
                     num_walks=200, p=1, q=0.5)
model = node2vec.fit(window=10, min_count=1)

# 노드 임베딩 벡터
embedding = model.wv['0']  # 노드 0의 임베딩

# 유사 노드 검색
similar = model.wv.most_similar('0', topn=5)

Graph Neural Networks (GNN)¶

메시지 패싱 프레임워크¶

대부분의 GNN은 메시지 패싱(Message Passing) 프레임워크를 따른다.

Message Passing Neural Network (MPNN):

각 레이어에서:
  1. Message: 이웃으로부터 메시지 수집
     m_v^{(l)} = AGG({h_u^{(l-1)} : u in N(v)})

  2. Update: 자신의 표현 갱신
     h_v^{(l)} = UPDATE(h_v^{(l-1)}, m_v^{(l)})

K개 레이어 스택:
  h_v^{(0)} = x_v  (초기 노드 특성)

  Layer 1: 1-hop 이웃 정보 집계
  Layer 2: 2-hop 이웃 정보 집계 (이웃의 이웃)
  ...
  Layer K: K-hop 이웃 정보 집계

  최종 h_v^{(K)}: K-hop 수용 영역(receptive field)의 정보 인코딩

1. GCN (Graph Convolutional Network)¶

Kipf & Welling (ICLR 2017). 스펙트럴 그래프 이론에 기반한 단순하고 효과적인 GNN.

GCN Layer:

  H^{(l+1)} = sigma(D_hat^{-1/2} * A_hat * D_hat^{-1/2} * H^{(l)} * W^{(l)})

  A_hat = A + I_N          (자기 루프 추가)
  D_hat = diag(sum(A_hat))  (차수 행렬)

  직관:
  - 각 노드의 표현을 이웃의 표현과 자기 자신의 평균으로 갱신
  - 정규화(D_hat^{-1/2})로 degree 차이에 의한 스케일 문제 해결

노드별 연산:
  h_v^{(l+1)} = sigma(sum_{u in N(v) union {v}} (1/sqrt(d_v * d_u)) * h_u^{(l)} * W^{(l)})

import torch
import torch.nn as nn
import torch.nn.functional as F

class GCNLayer(nn.Module):
    def __init__(self, in_features, out_features):
        super().__init__()
        self.weight = nn.Parameter(torch.FloatTensor(in_features, out_features))
        nn.init.xavier_uniform_(self.weight)

    def forward(self, x, adj_hat):
        """
        x: (N, in_features) 노드 특성
        adj_hat: (N, N) 정규화된 인접 행렬 (D^{-1/2} A_hat D^{-1/2})
        """
        support = torch.mm(x, self.weight)       # (N, out_features)
        output = torch.spmm(adj_hat, support)    # 이웃 집계
        return output

class GCN(nn.Module):
    def __init__(self, n_features, n_hidden, n_classes, dropout=0.5):
        super().__init__()
        self.gc1 = GCNLayer(n_features, n_hidden)
        self.gc2 = GCNLayer(n_hidden, n_classes)
        self.dropout = dropout

    def forward(self, x, adj):
        x = F.relu(self.gc1(x, adj))
        x = F.dropout(x, self.dropout, training=self.training)
        x = self.gc2(x, adj)
        return F.log_softmax(x, dim=1)

# 정규화된 인접 행렬 생성
def normalize_adj(adj):
    """D^{-1/2} (A + I) D^{-1/2}"""
    adj = adj + torch.eye(adj.size(0))
    degree = adj.sum(dim=1)
    d_inv_sqrt = degree.pow(-0.5)
    d_inv_sqrt[d_inv_sqrt == float('inf')] = 0
    d_mat = torch.diag(d_inv_sqrt)
    return d_mat @ adj @ d_mat

2. GraphSAGE¶

Hamilton et al. (NeurIPS 2017). 귀납적(inductive) 학습이 가능한 GNN으로, 학습 시 보지 못한 새 노드에도 적용 가능하다.

GraphSAGE:

GCN의 한계:
  - Transductive: 학습 시 전체 그래프 필요
  - 새 노드 추가 시 재학습 필요

GraphSAGE 해결:
  - 이웃 샘플링 + 집계 함수 학습
  - Inductive: 새 노드에도 즉시 적용 가능

알고리즘:
  for k = 1, ..., K:
    for v in V:
      1. 이웃 샘플링: N_sample(v) = SAMPLE(N(v), S)  (S개 이웃)
      2. 이웃 집계:   h_{N(v)}^k = AGG_k({h_u^{k-1} : u in N_sample(v)})
      3. 자기 결합:   h_v^k = sigma(W^k * CONCAT(h_v^{k-1}, h_{N(v)}^k))
      4. 정규화:      h_v^k = h_v^k / ||h_v^k||_2

집계 함수 옵션:
  Mean:  mean({h_u : u in N(v)})
  LSTM:  LSTM(PERMUTE({h_u : u in N(v)}))
  Pool:  max({sigma(W_pool * h_u + b) : u in N(v)})

3. GAT (Graph Attention Network)¶

Velickovic et al. (ICLR 2018). 어텐션 메커니즘으로 이웃의 중요도를 학습한다.

GAT Layer:

1. Attention coefficient 계산:
   e_{ij} = LeakyReLU(a^T * [W*h_i || W*h_j])

2. 정규화 (softmax):
   alpha_{ij} = softmax_j(e_{ij}) = exp(e_{ij}) / sum_{k in N(i)} exp(e_{ik})

3. 가중 집계:
   h_i' = sigma(sum_{j in N(i)} alpha_{ij} * W * h_j)

Multi-head Attention (K heads):
   h_i' = CONCAT_{k=1}^{K} sigma(sum_{j in N(i)} alpha_{ij}^k * W^k * h_j)
   또는 (마지막 레이어):
   h_i' = sigma((1/K) * sum_{k=1}^{K} sum_{j in N(i)} alpha_{ij}^k * W^k * h_j)

핵심 차이 (GCN vs GAT):
  GCN: 이웃 가중치 = 1/sqrt(d_i * d_j)  (구조 기반, 고정)
  GAT: 이웃 가중치 = alpha_{ij}          (학습 기반, 적응적)

import torch
import torch.nn as nn
import torch.nn.functional as F

class GATLayer(nn.Module):
    def __init__(self, in_features, out_features, n_heads=8, dropout=0.6):
        super().__init__()
        self.n_heads = n_heads
        self.head_dim = out_features // n_heads

        self.W = nn.Linear(in_features, out_features, bias=False)
        self.a = nn.Parameter(torch.FloatTensor(n_heads, 2 * self.head_dim))
        nn.init.xavier_uniform_(self.a.unsqueeze(0))

        self.leaky_relu = nn.LeakyReLU(0.2)
        self.dropout = nn.Dropout(dropout)

    def forward(self, x, edge_index):
        """
        x: (N, in_features)
        edge_index: (2, E) 엣지 인덱스
        """
        N = x.size(0)
        h = self.W(x).view(N, self.n_heads, self.head_dim)  # (N, heads, dim)

        src, dst = edge_index

        # Attention coefficients
        h_src = h[src]  # (E, heads, dim)
        h_dst = h[dst]  # (E, heads, dim)

        e = (torch.cat([h_src, h_dst], dim=-1) * self.a).sum(dim=-1)  # (E, heads)
        e = self.leaky_relu(e)

        # Softmax per destination node
        alpha = torch.zeros(N, self.n_heads, device=x.device)
        # (simplified -- 실제로는 scatter_softmax 사용)
        alpha = F.softmax(e, dim=0)  # 근사
        alpha = self.dropout(alpha)

        # Weighted aggregation
        out = torch.zeros(N, self.n_heads, self.head_dim, device=x.device)
        out.index_add_(0, dst, alpha.unsqueeze(-1) * h_src)

        return out.view(N, -1)  # (N, n_heads * head_dim)

GCN vs GraphSAGE vs GAT 비교¶

특성	GCN	GraphSAGE	GAT
집계 방식	정규화된 평균	샘플링 + 학습 가능	Attention 가중
학습 방식	Transductive	Inductive	둘 다 가능
이웃 처리	전체 이웃	샘플링	전체 이웃
확장성	중간	높음 (미니배치)	중간
해석성	낮음	중간	높음 (attention weights)
이질적 그래프	기본 지원 X	확장 가능	확장 가능

태스크별 분류¶

Node-Level Tasks¶

태스크	설명	예시
Node Classification	노드의 클래스 예측	논문 주제 분류, 사용자 프로파일링
Node Regression	노드의 연속값 예측	분자 내 원자의 전하 예측
Node Clustering	유사 노드 그룹핑	커뮤니티 탐지

Edge-Level Tasks¶

태스크	설명	예시
Link Prediction	미래 연결 예측	추천, 소셜 네트워크
Edge Classification	엣지 유형 분류	Knowledge Graph Completion

Graph-Level Tasks¶

태스크	설명	예시
Graph Classification	그래프 전체의 클래스 예측	분자 독성 예측, 약물 발견
Graph Regression	그래프의 연속값 예측	분자 성질 예측
Graph Generation	새 그래프 생성	약물 분자 설계

주요 벤치마크¶

데이터셋	도메인	노드 수	태스크
Cora	논문 인용	2,708	Node Classification
CiteSeer	논문 인용	3,327	Node Classification
PubMed	의학 논문	19,717	Node Classification
ogbn-arxiv	arXiv 논문	169,343	Node Classification
ogbn-products	Amazon 상품	2,449,029	Node Classification
ogbl-collab	공저 네트워크	235,868	Link Prediction
ZINC	분자	12,000 graphs	Graph Regression
QM9	분자 성질	130,000 graphs	Graph Regression

Cora 노드 분류 성능¶

방법	정확도 (%)
DeepWalk + Logistic	67.2
GCN (2-layer)	81.5
GAT (8-head)	83.0
GraphSAGE-mean	78.9
GIN	82.7
Graph Transformer	84.0+

실전 파이프라인 (PyG)¶

import torch
import torch.nn.functional as F
from torch_geometric.nn import GCNConv, GATConv, SAGEConv
from torch_geometric.datasets import Planetoid

# 1. 데이터 로드
dataset = Planetoid(root='/tmp/Cora', name='Cora')
data = dataset[0]

# 2. GNN 모델 정의
class GNNClassifier(torch.nn.Module):
    def __init__(self, in_channels, hidden_channels, out_channels, 
                 model_type='gcn', heads=8, dropout=0.6):
        super().__init__()
        self.dropout = dropout
        self.model_type = model_type

        if model_type == 'gcn':
            self.conv1 = GCNConv(in_channels, hidden_channels)
            self.conv2 = GCNConv(hidden_channels, out_channels)
        elif model_type == 'gat':
            self.conv1 = GATConv(in_channels, hidden_channels // heads, heads=heads)
            self.conv2 = GATConv(hidden_channels, out_channels, heads=1)
        elif model_type == 'sage':
            self.conv1 = SAGEConv(in_channels, hidden_channels)
            self.conv2 = SAGEConv(hidden_channels, out_channels)

    def forward(self, data):
        x, edge_index = data.x, data.edge_index

        x = self.conv1(x, edge_index)
        x = F.relu(x)
        x = F.dropout(x, p=self.dropout, training=self.training)
        x = self.conv2(x, edge_index)

        return F.log_softmax(x, dim=1)

# 3. 학습
model = GNNClassifier(
    in_channels=dataset.num_features,
    hidden_channels=64,
    out_channels=dataset.num_classes,
    model_type='gat'
)
optimizer = torch.optim.Adam(model.parameters(), lr=0.005, weight_decay=5e-4)

def train():
    model.train()
    optimizer.zero_grad()
    out = model(data)
    loss = F.nll_loss(out[data.train_mask], data.y[data.train_mask])
    loss.backward()
    optimizer.step()
    return loss.item()

def evaluate():
    model.eval()
    out = model(data)
    pred = out.argmax(dim=1)

    accs = {}
    for split in ['train_mask', 'val_mask', 'test_mask']:
        mask = getattr(data, split)
        correct = pred[mask] == data.y[mask]
        accs[split] = correct.sum().item() / mask.sum().item()
    return accs

# 4. 학습 루프
for epoch in range(200):
    loss = train()
    if epoch % 20 == 0:
        accs = evaluate()
        print(f"Epoch {epoch:3d} | Loss: {loss:.4f} | "
              f"Train: {accs['train_mask']:.4f} | "
              f"Val: {accs['val_mask']:.4f} | "
              f"Test: {accs['test_mask']:.4f}")

GNN의 한계와 해결책¶

한계	설명	해결 접근
Over-Smoothing	레이어 깊어지면 노드 표현이 동질화	DropEdge, PairNorm, JKNet
Over-Squashing	먼 노드의 정보가 병목에서 손실	Graph Transformer, Rewiring
Expressiveness	WL-test 한계 (일부 그래프 구분 불가)	GIN, Higher-order GNN
Scalability	대규모 그래프에서 메모리/시간 문제	Mini-batch (GraphSAGE), Cluster-GCN
Heterophily	이웃과 클래스가 다른 경우 성능 저하	H2GCN, GPR-GNN

최근 동향 (2024-2025)¶

트렌드	설명
Graph Transformers	Self-attention을 그래프에 적용, over-smoothing 완화
Graph Foundation Models	다양한 도메인에 전이 가능한 범용 그래프 모델
Dynamic GNN	시간에 따라 변화하는 그래프의 학습
Heterogeneous GNN	다중 노드/엣지 타입을 통합 처리
GNN + LLM	Knowledge Graph 기반 LLM 추론 강화
Trustworthy GNN	공정성, 프라이버시, 강건성, 설명가능성 연구
3D Graph (Geometric)	분자/단백질의 3D 구조 학습 (Equivariant GNN)
Temporal Knowledge Graphs	시간 축을 포함한 Knowledge Graph 완성

참고 문헌¶

Kipf, T. N. & Welling, M. (2017). Semi-Supervised Classification with Graph Convolutional Networks. ICLR 2017.
Hamilton, W. L., Ying, R., & Leskovec, J. (2017). Inductive Representation Learning on Large Graphs. NeurIPS 2017.
Velickovic, P., et al. (2018). Graph Attention Networks. ICLR 2018.
Xu, K., et al. (2019). How Powerful are Graph Neural Networks? ICLR 2019. (GIN)
Grover, A. & Leskovec, J. (2016). node2vec: Scalable Feature Learning for Networks. KDD 2016.
Wu, Z., et al. (2021). A Comprehensive Survey on Graph Neural Networks. IEEE TNNLS.
Liu, Y., et al. (2024). A Survey of Graph Neural Networks in Real World. arXiv:2403.04468.