Pixtral (Pixtral 12B / Pixtral Large)¶

1. 개요¶

항목	내용
개발사	Mistral AI (France)
공개일	Pixtral 12B: 2024.09, Pixtral Large: 2024.11
모델 타입	Open Source (Apache 2.0)
접근 방식	Hugging Face, Mistral API, vLLM

Pixtral은 Mistral AI의 첫 멀티모달 모델로, 처음부터 학습한 400M 파라미터 Vision Encoder와 Mistral NeMo/Large 2를 결합했다. 텍스트 성능 유지와 함께 강력한 비전 능력을 제공한다.

2. 모델 패밀리¶

모델	Vision Encoder	LLM	총 파라미터
Pixtral 12B	400M (자체 개발)	Mistral NeMo 12B	12.7B
Pixtral Large	400M (자체 개발)	Mistral Large 2	124B

3. 아키텍처¶

3.1 구조¶

[이미지 (가변 해상도)]
         |
         v
[16x16 패치 분할]
         |
         v
[PixtralViT (400M)] -- Vision Encoder (자체 학습)
         |
         v
[2D RoPE 적용]
         |
         v
[Projection (Multimodal Projector)]
         |
         v
[Mistral LLM] <-- [텍스트 토큰]
         |
         v
[출력 토큰]

3.2 핵심 컴포넌트¶

컴포넌트	사양
Vision Encoder	PixtralViT 400M (처음부터 학습)
패치 크기	16x16
위치 인코딩	2D Rotary Position Embedding
Projection	Multimodal Projector
LLM	Mistral NeMo 12B / Large 2 124B

3.3 PixtralViT 특징¶

Mistral이 처음부터 학습한 Vision Encoder
CLIP/SigLIP와 달리 자체 설계
2D RoPE로 임의 해상도 지원
이미지 순서/위치 정보 보존

4. 이미지 처리¶

4.1 해상도 지원¶

항목	사양
최소 해상도	16x16 (1 패치)
최대 해상도	제한 없음 (토큰 수로 제한)
패치 크기	16x16
토큰 계산	(width/16) x (height/16)

4.2 해상도별 토큰 수¶

해상도	패치 수	토큰 수
256x256	16x16	256
512x512	32x32	1,024
1024x1024	64x64	4,096
2048x2048	128x128	16,384

4.3 멀티 이미지¶

항목	지원
최대 이미지 수	컨텍스트 제한 내 무제한
이미지 인터리빙	지원
이미지 간 참조	지원

4.4 지원 포맷¶

JPEG
PNG
WebP
GIF

5. 벤치마크 성능¶

5.1 Pixtral 12B¶

벤치마크	점수	비고
MMMU (val)	52.5%	오픈소스 최고 수준
MathVista	58.0%	-
ChartQA	81.8%	-
DocVQA	90.7%	OCR 우수
AI2D	79.8%	-
VQAv2	78.6%	-
TextVQA	75.2%	-
MM-MT-Bench	7.66	멀티턴 대화

5.2 Pixtral Large (124B)¶

벤치마크	점수	비고
MMMU (val)	65.4%	GPT-4o 근접
MathVista	69.4%	-
ChartQA	88.2%	-
DocVQA	94.1%	-

5.3 텍스트 전용 벤치마크 (Pixtral 12B)¶

Vision 추가에도 텍스트 성능 유지:

벤치마크	Mistral NeMo	Pixtral 12B
MMLU	68.0%	68.3%
HumanEval	68.0%	70.1%
MATH	40.0%	41.2%

6. 사용 방법¶

6.1 Mistral API¶

from mistralai import Mistral
import base64

client = Mistral(api_key="YOUR_API_KEY")

# URL 이미지
response = client.chat.complete(
    model="pixtral-12b-2409",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
            ]
        }
    ]
)

# Base64 이미지
with open("image.jpg", "rb") as f:
    image_data = base64.b64encode(f.read()).decode()

response = client.chat.complete(
    model="pixtral-12b-2409",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "What's in this image?"},
                {"type": "image_url", "image_url": {"url": f"data:image/jpeg;base64,{image_data}"}}
            ]
        }
    ]
)

6.2 vLLM 서빙 (권장)¶

vllm serve mistralai/Pixtral-12B-2409 \
    --tokenizer_mode mistral \
    --limit_mm_per_prompt 'image=4' \
    --max-model-len 16384

from openai import OpenAI

client = OpenAI(base_url="http://localhost:8000/v1", api_key="dummy")

response = client.chat.completions.create(
    model="mistralai/Pixtral-12B-2409",
    messages=[
        {
            "role": "user",
            "content": [
                {"type": "text", "text": "Describe this image."},
                {"type": "image_url", "image_url": {"url": "https://example.com/image.jpg"}}
            ]
        }
    ],
    max_tokens=512
)

6.3 Hugging Face Transformers¶

from transformers import AutoProcessor, LlavaForConditionalGeneration
from PIL import Image
import torch

model_id = "mistral-community/pixtral-12b"

processor = AutoProcessor.from_pretrained(model_id)
model = LlavaForConditionalGeneration.from_pretrained(
    model_id,
    torch_dtype=torch.bfloat16,
    device_map="auto"
)

image = Image.open("image.jpg")

messages = [
    {
        "role": "user",
        "content": [
            {"type": "image"},
            {"type": "text", "text": "Describe this image in detail."}
        ]
    }
]

prompt = processor.apply_chat_template(messages, add_generation_prompt=True)
inputs = processor(images=image, text=prompt, return_tensors="pt").to(model.device)

outputs = model.generate(**inputs, max_new_tokens=512)
result = processor.decode(outputs[0], skip_special_tokens=True)

6.4 mistral-inference (공식)¶

from mistral_inference.transformer import Transformer
from mistral_inference.generate import generate
from mistral_common.tokens.tokenizers.mistral import MistralTokenizer
from mistral_common.protocol.instruct.messages import UserMessage, TextChunk, ImageURLChunk
from mistral_common.protocol.instruct.request import ChatCompletionRequest

tokenizer = MistralTokenizer.from_file("tokenizer.model.v3")
model = Transformer.from_folder("Pixtral-12B-2409")

request = ChatCompletionRequest(
    messages=[
        UserMessage(content=[
            ImageURLChunk(image_url="https://example.com/image.jpg"),
            TextChunk(text="Describe this image.")
        ])
    ]
)

tokens = tokenizer.encode_chat_completion(request).tokens
output = generate([tokens], model, max_tokens=256)
result = tokenizer.decode(output[0])

7. VRAM 요구량¶

7.1 Pixtral 12B¶

정밀도	VRAM	비고
FP32	48GB	-
BF16	24GB	권장
INT8	14GB	-
INT4	8GB	-

7.2 Pixtral Large (124B)¶

정밀도	VRAM	구성
BF16	248GB	4x A100 80GB
INT8	124GB	2x A100 80GB
INT4	64GB	1x A100 80GB

7.3 추론 속도 (A100 80GB)¶

모델	BF16	INT8
Pixtral 12B	48 tok/s	65 tok/s
Pixtral Large	8 tok/s	14 tok/s

8. 장점¶

장점	설명
텍스트 성능 유지	Vision 추가에도 LLM 성능 동일
임의 해상도	어떤 이미지 크기도 처리
강력한 OCR	문서 이해 우수
Apache 2.0	상업적 사용 자유
멀티턴 비전	대화에서 이미지 참조
vLLM 지원	프로덕션 서빙 용이

9. 단점¶

단점	설명
고해상도 토큰 폭발	큰 이미지 시 토큰 급증
Large 모델 요구사양	124B는 고사양 필요
비디오 미지원	이미지만 처리
HF 변환 복잡	공식은 mistral-inference 권장

10. 사용 사례¶

10.1 적합한 사용 사례¶

문서 분석 및 OCR
차트/그래프 이해
코드 스크린샷 분석
멀티턴 비전 대화
프로덕션 배포 (vLLM)
연구 및 Fine-tuning

10.2 부적합한 사용 사례¶

비디오 분석
초고해상도 대량 처리
저사양 환경 (Large 모델)

11. API 가격 (Mistral Platform)¶

모델	입력 (1M 토큰)	출력 (1M 토큰)
Pixtral 12B	$0.15	$0.15
Pixtral Large	$2.00	$6.00

Pixtral (Pixtral 12B / Pixtral Large)¶

1. 개요¶

2. 모델 패밀리¶

3. 아키텍처¶

3.1 구조¶

3.2 핵심 컴포넌트¶

3.3 PixtralViT 특징¶

4. 이미지 처리¶

4.1 해상도 지원¶

4.2 해상도별 토큰 수¶

4.3 멀티 이미지¶

4.4 지원 포맷¶

5. 벤치마크 성능¶

5.1 Pixtral 12B¶

5.2 Pixtral Large (124B)¶

5.3 텍스트 전용 벤치마크 (Pixtral 12B)¶

6. 사용 방법¶

6.1 Mistral API¶

6.2 vLLM 서빙 (권장)¶

6.3 Hugging Face Transformers¶

6.4 mistral-inference (공식)¶

7. VRAM 요구량¶

7.1 Pixtral 12B¶

7.2 Pixtral Large (124B)¶

7.3 추론 속도 (A100 80GB)¶

8. 장점¶

9. 단점¶

10. 사용 사례¶

10.1 적합한 사용 사례¶

10.2 부적합한 사용 사례¶

11. API 가격 (Mistral Platform)¶

12. 참고 자료¶