AWS for ML/Data¶

개요¶

AWS란?¶

Amazon Web Services(AWS)는 세계 최대의 클라우드 컴퓨팅 플랫폼으로, 200개 이상의 서비스를 제공한다. 2006년 S3와 EC2로 시작하여 현재 전 세계 클라우드 시장의 약 32%를 차지하고 있다.

왜 ML/Data 워크로드에 AWS를 사용하는가?¶

장점	설명
스케일	데이터 증가에 따라 무제한 확장 가능 (S3, Redshift)
GPU 인프라	다양한 GPU 인스턴스 (P4d, P5, G5) 즉시 사용
매니지드 서비스	SageMaker, EMR 등으로 인프라 관리 최소화
통합 생태계	데이터 수집 → 저장 → 처리 → 학습 → 배포 원스톱
비용 효율	Spot 인스턴스로 최대 90% 비용 절감

ML/Data 핵심 서비스 맵¶

영역	서비스	용도
스토리지	S3	데이터 레이크, 모델 저장소
	EBS/EFS	블록/파일 스토리지
컴퓨팅	EC2	GPU 학습, 커스텀 워크로드
	Lambda	서버리스 추론, ETL 트리거
데이터베이스	RDS/Aurora	구조화 데이터
	DynamoDB	피처 스토어, 메타데이터
	Redshift	데이터 웨어하우스
ML	SageMaker	엔드투엔드 ML 플랫폼
	Bedrock	파운데이션 모델 API
데이터 처리	EMR	Spark 기반 대규모 처리
	Glue	ETL, 데이터 카탈로그
	Athena	S3 직접 쿼리 (서버리스)
오케스트레이션	Step Functions	워크플로우 관리
	EKS	Kubernetes 클러스터

1. 스토리지: S3¶

1.1 S3 기본¶

import boto3

s3 = boto3.client('s3')

# 버킷 생성
s3.create_bucket(
    Bucket='my-ml-data',
    CreateBucketConfiguration={'LocationConstraint': 'ap-northeast-2'}
)

# 파일 업로드
s3.upload_file('local_file.csv', 'my-ml-data', 'data/train.csv')

# 파일 다운로드
s3.download_file('my-ml-data', 'data/train.csv', 'local_file.csv')

# 객체 리스트
response = s3.list_objects_v2(Bucket='my-ml-data', Prefix='data/')
for obj in response.get('Contents', []):
    print(obj['Key'], obj['Size'])

1.2 S3 스토리지 클래스¶

클래스	용도	비용	접근 시간
Standard	자주 접근하는 데이터	높음	즉시
Intelligent-Tiering	접근 패턴 불확실	자동	즉시
Standard-IA	가끔 접근	중간	즉시
Glacier Instant	아카이브 (즉시 접근)	낮음	밀리초
Glacier Flexible	장기 아카이브	매우 낮음	분~시간
Glacier Deep Archive	장기 보관	최저	12시간

1.3 ML 데이터 레이크 구조¶

s3://my-ml-data/
├── raw/                          # 원본 데이터
│   ├── 2024/01/01/
│   └── 2024/01/02/
├── processed/                    # 전처리된 데이터
│   ├── train/
│   ├── validation/
│   └── test/
├── features/                     # Feature Store 데이터
├── models/                       # 학습된 모델
│   ├── experiment-001/
│   └── production/
├── artifacts/                    # 학습 아티팩트
│   ├── logs/
│   └── checkpoints/
└── metadata/                     # 메타데이터

1.4 라이프사이클 정책¶

{
    "Rules": [
        {
            "ID": "MoveOldDataToGlacier",
            "Status": "Enabled",
            "Filter": {"Prefix": "raw/"},
            "Transitions": [
                {
                    "Days": 30,
                    "StorageClass": "STANDARD_IA"
                },
                {
                    "Days": 90,
                    "StorageClass": "GLACIER"
                }
            ]
        },
        {
            "ID": "DeleteOldCheckpoints",
            "Status": "Enabled",
            "Filter": {"Prefix": "artifacts/checkpoints/"},
            "Expiration": {"Days": 14}
        }
    ]
}

2. SageMaker¶

2.1 학습 작업¶

import sagemaker
from sagemaker.estimator import Estimator

session = sagemaker.Session()
role = 'arn:aws:iam::123456789:role/SageMakerRole'

# 커스텀 컨테이너로 학습
estimator = Estimator(
    image_uri='123456789.dkr.ecr.ap-northeast-2.amazonaws.com/my-ml-image:latest',
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',  # GPU 인스턴스
    volume_size=100,  # GB
    max_run=3600,  # 최대 실행 시간 (초)
    output_path='s3://my-ml-data/models/',
    hyperparameters={
        'epochs': 10,
        'batch_size': 32,
        'learning_rate': 0.001,
    },
    environment={
        'WANDB_API_KEY': '...',
    },
)

# 학습 시작
estimator.fit({
    'train': 's3://my-ml-data/processed/train/',
    'validation': 's3://my-ml-data/processed/validation/',
})

2.2 Spot Training (비용 절감)¶

from sagemaker.estimator import Estimator

estimator = Estimator(
    image_uri='...',
    role=role,
    instance_count=1,
    instance_type='ml.p3.2xlarge',
    use_spot_instances=True,  # Spot 인스턴스 사용
    max_wait=7200,  # 최대 대기 시간
    max_run=3600,
    checkpoint_s3_uri='s3://my-ml-data/checkpoints/',  # 체크포인트로 중단 대비
    output_path='s3://my-ml-data/models/',
)

2.3 엔드포인트 배포¶

from sagemaker.model import Model

# 모델 등록
model = Model(
    image_uri='...',
    model_data='s3://my-ml-data/models/model.tar.gz',
    role=role,
)

# 엔드포인트 배포
predictor = model.deploy(
    initial_instance_count=2,
    instance_type='ml.m5.xlarge',
    endpoint_name='my-model-endpoint',
)

# 예측
result = predictor.predict({'features': [0.1, 0.2, 0.3]})

2.4 오토스케일링¶

import boto3

client = boto3.client('application-autoscaling')

# 스케일링 타겟 등록
client.register_scalable_target(
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-model-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    MinCapacity=1,
    MaxCapacity=10,
)

# 스케일링 정책
client.put_scaling_policy(
    PolicyName='InvocationsPerInstancePolicy',
    ServiceNamespace='sagemaker',
    ResourceId='endpoint/my-model-endpoint/variant/AllTraffic',
    ScalableDimension='sagemaker:variant:DesiredInstanceCount',
    PolicyType='TargetTrackingScaling',
    TargetTrackingScalingPolicyConfiguration={
        'TargetValue': 1000,  # 인스턴스당 1000 호출
        'PredefinedMetricSpecification': {
            'PredefinedMetricType': 'SageMakerVariantInvocationsPerInstance'
        },
        'ScaleInCooldown': 300,
        'ScaleOutCooldown': 60,
    },
)

3. Bedrock¶

3.1 기본 사용¶

import boto3
import json

bedrock = boto3.client('bedrock-runtime', region_name='us-east-1')

# Claude 3 호출
response = bedrock.invoke_model(
    modelId='anthropic.claude-3-sonnet-20240229-v1:0',
    contentType='application/json',
    accept='application/json',
    body=json.dumps({
        'anthropic_version': 'bedrock-2023-05-31',
        'max_tokens': 1024,
        'messages': [
            {'role': 'user', 'content': '한국의 수도는?'}
        ]
    })
)

result = json.loads(response['body'].read())
print(result['content'][0]['text'])

3.2 스트리밍¶

def stream_bedrock_response(prompt):
    response = bedrock.invoke_model_with_response_stream(
        modelId='anthropic.claude-3-sonnet-20240229-v1:0',
        contentType='application/json',
        body=json.dumps({
            'anthropic_version': 'bedrock-2023-05-31',
            'max_tokens': 1024,
            'messages': [{'role': 'user', 'content': prompt}]
        })
    )

    for event in response['body']:
        chunk = json.loads(event['chunk']['bytes'])
        if chunk['type'] == 'content_block_delta':
            yield chunk['delta']['text']

3.3 Knowledge Base (RAG)¶

# Knowledge Base 생성은 콘솔 또는 CloudFormation으로
# 조회는 API로

bedrock_agent = boto3.client('bedrock-agent-runtime')

response = bedrock_agent.retrieve_and_generate(
    input={'text': '회사 휴가 정책은?'},
    retrieveAndGenerateConfiguration={
        'type': 'KNOWLEDGE_BASE',
        'knowledgeBaseConfiguration': {
            'knowledgeBaseId': 'KNOWLEDGE_BASE_ID',
            'modelArn': 'arn:aws:bedrock:us-east-1::foundation-model/anthropic.claude-3-sonnet-20240229-v1:0',
        }
    }
)

print(response['output']['text'])

4. Lambda¶

4.1 ML 추론 Lambda¶

# lambda_function.py
import json
import boto3
import pickle
import os

# 콜드 스타트 최소화를 위해 전역 변수로 모델 로드
s3 = boto3.client('s3')
model = None

def load_model():
    global model
    if model is None:
        s3.download_file(
            os.environ['MODEL_BUCKET'],
            os.environ['MODEL_KEY'],
            '/tmp/model.pkl'
        )
        with open('/tmp/model.pkl', 'rb') as f:
            model = pickle.load(f)
    return model

def handler(event, context):
    model = load_model()

    # 입력 파싱
    body = json.loads(event['body'])
    features = body['features']

    # 예측
    prediction = model.predict([features])[0]

    return {
        'statusCode': 200,
        'body': json.dumps({
            'prediction': float(prediction)
        })
    }

4.2 Lambda 컨테이너 이미지¶

# Dockerfile
FROM public.ecr.aws/lambda/python:3.10

# 의존성 설치
COPY requirements.txt .
RUN pip install -r requirements.txt

# 코드 복사
COPY lambda_function.py ${LAMBDA_TASK_ROOT}

# 핸들러 지정
CMD ["lambda_function.handler"]

# 빌드 및 푸시
docker build -t my-ml-lambda .
aws ecr get-login-password | docker login --username AWS --password-stdin 123456789.dkr.ecr.ap-northeast-2.amazonaws.com
docker tag my-ml-lambda:latest 123456789.dkr.ecr.ap-northeast-2.amazonaws.com/my-ml-lambda:latest
docker push 123456789.dkr.ecr.ap-northeast-2.amazonaws.com/my-ml-lambda:latest

4.3 Lambda 제한 사항¶

항목	제한
메모리	128MB ~ 10GB
실행 시간	최대 15분
패키지 크기	250MB (레이어 포함)
컨테이너 이미지	10GB
동시 실행	1000 (기본, 증가 요청 가능)
/tmp 저장소	512MB ~ 10GB

5. EC2/EKS GPU 인스턴스¶

5.1 GPU 인스턴스 유형¶

유형	GPU	메모리	용도
p3.2xlarge	V100 x1	16GB	학습/추론
p3.8xlarge	V100 x4	64GB	분산 학습
p4d.24xlarge	A100 x8	640GB	대규모 학습
g4dn.xlarge	T4 x1	16GB	추론 (비용 효율)
g5.xlarge	A10G x1	24GB	추론/소규모 학습
inf1.xlarge	Inferentia x1	-	추론 최적화
inf2.xlarge	Inferentia2 x1	-	LLM 추론

5.2 EKS GPU 노드 그룹¶

# eks-gpu-nodegroup.yaml
apiVersion: eksctl.io/v1alpha5
kind: ClusterConfig

metadata:
  name: ml-cluster
  region: ap-northeast-2

managedNodeGroups:
  - name: gpu-nodes
    instanceType: g5.xlarge
    desiredCapacity: 2
    minSize: 0
    maxSize: 10
    volumeSize: 100
    ami: auto
    amiFamily: AmazonLinux2
    labels:
      node-type: gpu
    taints:
      - key: nvidia.com/gpu
        value: "true"
        effect: NoSchedule

6. 비용 최적화¶

6.1 Spot 인스턴스 전략¶

# SageMaker Spot
estimator = Estimator(
    use_spot_instances=True,
    max_wait=7200,
    checkpoint_s3_uri='s3://bucket/checkpoints/',
)

# EKS Spot
# Karpenter 사용 권장

6.2 비용 모니터링¶

import boto3

ce = boto3.client('ce')

response = ce.get_cost_and_usage(
    TimePeriod={
        'Start': '2024-01-01',
        'End': '2024-02-01'
    },
    Granularity='MONTHLY',
    Metrics=['UnblendedCost'],
    GroupBy=[
        {'Type': 'DIMENSION', 'Key': 'SERVICE'},
    ],
    Filter={
        'Tags': {
            'Key': 'Project',
            'Values': ['ml-project']
        }
    }
)

for group in response['ResultsByTime'][0]['Groups']:
    service = group['Keys'][0]
    cost = group['Metrics']['UnblendedCost']['Amount']
    print(f"{service}: ${float(cost):.2f}")

7. 비용 상세¶

7.1 SageMaker 비용 구조¶

구성 요소	비용 (ap-northeast-2)	최적화 방법
Notebook Instance (ml.t3.medium)	$0.05/시간	사용 후 중지
Training (ml.p3.2xlarge)	$4.19/시간	Spot 사용 (70% 절감)
Endpoint (ml.m5.xlarge)	$0.27/시간	오토스케일링
Processing (ml.m5.xlarge)	$0.27/시간	필요 시만 실행
S3 데이터 전송	VPC 내 무료	VPC Endpoint 사용

7.2 Bedrock 비용 (토큰 기반)¶

모델	Input (1K 토큰)	Output (1K 토큰)	용도
Claude 3 Haiku	$0.00025	$0.00125	빠른 응답, 저비용
Claude 3 Sonnet	$0.003	$0.015	균형잡힌 성능
Claude 3 Opus	$0.015	$0.075	최고 품질
Titan Text Lite	$0.0003	$0.0004	간단한 작업

비용 계산 예시:

월간 사용량: 100만 요청, 평균 500 input + 200 output 토큰

Claude 3 Haiku:
- Input: 1,000,000 x 0.5K x $0.00025 = $125
- Output: 1,000,000 x 0.2K x $0.00125 = $250
- 총: $375/월

Claude 3 Sonnet:
- Input: 1,000,000 x 0.5K x $0.003 = $1,500
- Output: 1,000,000 x 0.2K x $0.015 = $3,000
- 총: $4,500/월

7.3 Lambda vs EC2 비용 비교¶

시나리오	Lambda	EC2 (t3.medium)	권장
1만 요청/일	~$3/월	$30/월	Lambda
100만 요청/일	~$200/월	$30/월	EC2
가변 트래픽	사용량 비례	고정	Lambda
콜드스타트 민감	비권장	권장	EC2

7.4 비용 절감 체크리스트¶

[ ] Spot Instance 사용 (학습 워크로드)
[ ] S3 라이프사이클 정책 설정
[ ] 미사용 Notebook 인스턴스 중지
[ ] 엔드포인트 오토스케일링 설정
[ ] VPC Endpoint로 데이터 전송 비용 절감
[ ] Reserved Capacity 검토 (상시 워크로드)
[ ] CloudWatch 알람으로 비용 모니터링
[ ] SageMaker Savings Plans 검토

8. 실무 팁¶

8.1 SageMaker 권장 패턴¶

# 1. 로컬 테스트 후 SageMaker 실행
# 먼저 로컬에서 작은 데이터로 테스트

# 2. 체크포인트 필수 (Spot 사용 시)
checkpoint_s3_uri = f"s3://{bucket}/checkpoints/{job_name}"

# 3. 분산 학습 시 instance_count 점진적 증가
# 1 -> 2 -> 4 순서로 스케일업하며 성능 확인

# 4. 학습 완료 후 모델 압축
model_data = estimator.model_data  # S3 경로

8.2 S3 성능 최적화¶

# 1. 멀티파트 업로드 (대용량 파일)
from boto3.s3.transfer import TransferConfig

config = TransferConfig(
    multipart_threshold=100 * 1024 * 1024,  # 100MB
    max_concurrency=10,
    multipart_chunksize=100 * 1024 * 1024,
)
s3.upload_file('large_file.tar.gz', bucket, key, Config=config)

# 2. Prefix 분산 (높은 처리량 필요 시)
# 나쁜 예: data/2024/01/01/file1.csv
# 좋은 예: abc123/data/2024/01/01/file1.csv (해시 prefix)

# 3. S3 Select로 일부 데이터만 조회
response = s3.select_object_content(
    Bucket=bucket,
    Key='data.csv',
    ExpressionType='SQL',
    Expression="SELECT * FROM s3object WHERE category = 'A'",
    InputSerialization={'CSV': {'FileHeaderInfo': 'USE'}},
    OutputSerialization={'CSV': {}}
)

8.3 보안 Best Practices¶

# 1. IAM 최소 권한 원칙
{
    "Effect": "Allow",
    "Action": ["s3:GetObject"],
    "Resource": "arn:aws:s3:::my-bucket/models/*"  # 특정 경로만
}

# 2. KMS 암호화
s3.put_object(
    Bucket=bucket,
    Key=key,
    Body=data,
    ServerSideEncryption='aws:kms',
    SSEKMSKeyId='alias/my-key'
)

# 3. VPC 내 SageMaker 실행
estimator = Estimator(
    subnets=['subnet-xxx'],
    security_group_ids=['sg-xxx'],
    enable_network_isolation=True,  # 인터넷 차단
)

AWS for ML/Data¶

개요¶

AWS란?¶

왜 ML/Data 워크로드에 AWS를 사용하는가?¶

ML/Data 핵심 서비스 맵¶

1. 스토리지: S3¶

1.1 S3 기본¶

1.2 S3 스토리지 클래스¶

1.3 ML 데이터 레이크 구조¶

1.4 라이프사이클 정책¶

2. SageMaker¶

2.1 학습 작업¶

2.2 Spot Training (비용 절감)¶

2.3 엔드포인트 배포¶

2.4 오토스케일링¶

3. Bedrock¶

3.1 기본 사용¶

3.2 스트리밍¶

3.3 Knowledge Base (RAG)¶

4. Lambda¶

4.1 ML 추론 Lambda¶

4.2 Lambda 컨테이너 이미지¶

4.3 Lambda 제한 사항¶

5. EC2/EKS GPU 인스턴스¶

5.1 GPU 인스턴스 유형¶

5.2 EKS GPU 노드 그룹¶

6. 비용 최적화¶

6.1 Spot 인스턴스 전략¶

6.2 비용 모니터링¶

7. 비용 상세¶

7.1 SageMaker 비용 구조¶

7.2 Bedrock 비용 (토큰 기반)¶

7.3 Lambda vs EC2 비용 비교¶

7.4 비용 절감 체크리스트¶

8. 실무 팁¶

8.1 SageMaker 권장 패턴¶

8.2 S3 성능 최적화¶

8.3 보안 Best Practices¶

참고 자료¶