트랜스포머와 Attention

분류: Layer 11 - AI 기초 & 머신러닝 | 선수지식: L11-10 (ML 수학 토대), L11-40 (신경망과 역전파)

트랜스포머와 Attention — Self-Attention, KV Cache, Long-Context

1. 한 줄 정의

트랜스포머는 attention 기반 sequence 모델이고, attention은 query·key·value의 가중 합으로 토큰 간 관계를 동적으로 계산하는 메커니즘이다. LLM·번역·이미지(ViT) 모두 같은 구조 위에 있다.

2. 왜 중요한가

현대 LLM 아키텍처의 표준: GPT, LLaMA, Claude, Gemini 모두 decoder-only transformer
KV cache·context length 비용 직관: “왜 LLM은 long-context가 비싼가”의 답이 attention 식 안에 있다
신기법의 토대: GQA, MQA, sliding window, MoE, FlashAttention 모두 이 구조의 변형
운영 비용 모델링: input·output 토큰 비용 차이, KV cache 메모리, batch 결정의 핵심 변수

3. 핵심 개념

3.1 RNN의 한계 → Attention의 등장

이전 시대(RNN/LSTM)의 한계:

순차 처리: 토큰을 한 개씩 처리 → GPU 병렬화 어려움
Long-range dependency 약함: 멀리 떨어진 토큰 간 정보 전달이 약화 (vanishing gradient의 시간 버전)
state bottleneck: 한 hidden state로 전체 과거를 압축

“Attention is All You Need” (Vaswani et al., 2017)이 RNN을 버리고 attention만으로 충분함을 보였다 — 모든 토큰이 모든 토큰과 직접 연결되며 병렬 계산 가능.

3.2 Self-Attention 수식

각 토큰이 다른 모든 토큰을 얼마나 참고할지를 동적으로 계산.

Attention(Q, K, V) = softmax(Q K^T / √d_k) V

Q (query): n × d_k    ← 각 토큰이 "무엇을 찾는가"
K (key):   n × d_k    ← 각 토큰이 "무엇을 가지고 있는가"
V (value): n × d_v    ← 각 토큰의 "정보 내용"

해석:

Q K^T: 모든 토큰 쌍의 내적 → 유사도 행렬 (n × n)
/ √d_k: 분산 정규화 (L11-10 §3.2 참고). softmax saturate 방지
softmax(...): 행별로 정규화 → attention weight (합 1)
× V: weight로 value를 가중 합 → 각 토큰의 “정보 혼합”

3.3 Multi-Head Attention

attention을 h개의 작은 head로 분할 후 병렬 계산. 각 head가 다른 종류의 관계를 학습 (구문, 의미, 거리 등).

MultiHead(Q,K,V) = Concat(head_1, ..., head_h) · W_O
head_i = Attention(Q · W_Q^i, K · W_K^i, V · W_V^i)

운영 직관:

LLaMA-3-8B: num_heads=32, num_kv_heads=8 (GQA), d_head=128, hidden=4096 (= 32 × 128)
LLaMA-3-70B: num_heads=64, num_kv_heads=8 (GQA), d_head=128, hidden=8192
query head와 KV head 수가 다름(GQA) → KV cache는 num_kv_heads 기준으로 계산

3.4 Positional Encoding

self-attention은 순서 정보가 없다 (집합처럼 작동). 위치 정보를 별도로 주입.

방식	설명	사용처
Sinusoidal (절대 위치)	sin/cos로 각 위치를 고유 벡터로	Original Transformer
Learned positional	위치 embedding을 학습	BERT, GPT-2
RoPE (Rotary)	회전 행렬을 Q·K에 적용. 상대 위치 자연스럽게 표현	LLaMA, Mistral, Qwen 표준
ALiBi	attention score에 거리 비례 페널티	BLOOM, MPT
NoPE	위치 정보 없이도 학습. 일부 long-context 모델에서 실험	일부 연구

RoPE가 LLM 표준인 이유: long-context 외삽이 잘 되고(특히 NTK-aware/YaRN scaling 결합), 추가 파라미터 없음. Llama 3.1·128K context도 RoPE 변형.

3.5 Transformer Block 구조

LLM 1개 layer는 다음 둘로 구성:

x' = x + Attention(LayerNorm(x))           # attention sub-layer
y  = x' + FFN(LayerNorm(x'))                # FFN sub-layer

Pre-norm: residual 바깥에 정규화 적용 (L11-40 §3.8). 정규화 함수는 LayerNorm(BERT, GPT-2/3) 또는 RMSNorm(LLaMA·Mistral·Qwen·DeepSeek 표준 — RMSNorm이 LayerNorm보다 약간 빠르고 거의 같은 품질)
Residual connection: 두 sub-layer 모두
FFN: 보통 hidden_dim → 4×hidden_dim → hidden_dim. 활성화는 SwiGLU(LLaMA·Mistral) 또는 GELU(BERT, GPT)
레이어 수: LLaMA-3-8B는 32 layers, 70B는 80 layers. 이 layer가 통째 stack

3.6 Encoder-only / Decoder-only / Encoder-Decoder

종류	대표	attention 방향	용도
Encoder-only	BERT, RoBERTa	양방향 (full attention)	분류, 임베딩, NER
Decoder-only	GPT, LLaMA, Claude	단방향 (causal mask)	현대 LLM 표준
Encoder-Decoder	T5, BART, NLLB	encoder=양방향, decoder=cross	번역, 요약

Decoder-only가 LLM 표준이 된 이유:

단순한 next-token prediction 하나로 학습 가능 (자기지도, L11-30 §3.4)
in-context learning 자연스럽게 emergence
인코더 분리할 필요가 없음 (모든 작업을 prompt로 표현 가능)

3.7 Causal Mask

decoder-only는 미래 토큰을 보면 안 된다 (next-token prediction이 cheat가 됨). attention 행렬의 상삼각을 -inf로 마스킹 → softmax 후 0이 됨.

attention scores (n=4):
[A,  -inf, -inf, -inf]
[B,   C,   -inf, -inf]
[D,   E,    F,   -inf]
[G,   H,    I,    J  ]

i번째 토큰은 0~i 토큰만 참고 → 학습 시 모든 위치 동시 계산 가능 (teacher forcing).

3.8 KV Cache — Inference의 핵심

학습 시에는 모든 토큰을 한 번에 처리하지만, inference 시에는 한 토큰씩 생성한다. 매 토큰마다 attention 전체를 재계산하면 O(n²) 비용 — 그래서 K, V를 캐시.

KV cache 메모리 = 2 × num_layers × num_kv_heads × seq_len × d_head × bytes_per_param

(2 = K와 V, num_kv_heads는 GQA·MQA에서 num_heads와 다름)

LLaMA-3-8B 예시 (32 layers, num_kv_heads=8 (GQA), 128 d_head, fp16):

2 × 32 × 8 × seq_len × 128 × 2 = 131072 × seq_len bytes ≈ seq_len × 0.125MB

→ 2k context = 0.25GB, 32k context = ~4GB, 128k context = ~16GB. KV cache가 long-context inference 메모리의 대부분. (GQA 미반영하고 query head 32로 계산하면 자릿수가 4배 틀림 — 운영 견적의 흔한 오류.)

KV Cache를 줄이는 기법

MQA (Multi-Query Attention) (Shazeer 2019): 모든 head가 같은 K·V 공유. KV cache를 num_heads배 줄임 (32→1). 품질 손실 일부.
GQA (Grouped-Query Attention) (Ainslie 2023): K·V를 group으로 공유. LLaMA-2-70B, LLaMA-3 모두 GQA. MQA의 품질 손실과 MHA의 메모리 비용 사이 절충.
MLA (Multi-head Latent Attention) (DeepSeek-V2/V3): K·V를 저차원 latent로 압축. 기록상 가장 작은 KV cache.
Quantized KV cache: KV를 fp8/int8/int4로. 추가 압축.

3.9 FlashAttention — IO-Aware Attention

attention의 메모리 병목은 HBM ↔ SRAM 데이터 이동이지 연산이 아님. FlashAttention(Dao 2022)은:

attention 행렬을 통째 메모리에 저장하지 않고 chunk 단위 streaming
log-sum-exp trick(L11-10 §3.3)으로 chunk별 softmax 정확히 결합
활성화 메모리 O(n²) → O(n) (FLOPs는 여전히 O(n²), 정확한 attention 유지하면서 IO만 최적화)
속도 2~4배 ↑, LLM 학습·inference의 사실상 표준
FlashAttention-3 (2024, Dao): H100에서 FP16 ~~740 TFLOPs/s(H100 이론치 75% 활용), FP8 ~~1.2 PFLOPs/s. FA-2 대비 forward 1.5~~2.0×, backward 1.5~~1.75×

3.10 Long-Context 기법

Context length를 늘리면 attention이 O(n²)으로 비싸진다. 운영 기법들:

Sliding Window (Mistral 7B): 가까운 W개 토큰만 attention. local 처리는 잘하나 long-range는 layer 누적으로
Sparse Attention: 일부 토큰 쌍만 계산. Longformer, BigBird
Linear Attention: kernel trick으로 O(n) 복잡도. Linformer, Performer (품질 trade-off)
State Space Models (Mamba, Mamba-2): 트랜스포머 대안. O(n) 복잡도, RNN 부활
Hybrid (Mamba+Transformer): Jamba 1.5는 1:7 ratio로 attention과 Mamba 결합. 256k context에서 KV cache ~4GB (Mixtral 같은 dense MoE는 ~32GB)
RoPE Scaling (NTK-aware, YaRN): 학습된 RoPE를 long-context로 외삽. YaRN은 NTK-aware 대비 10× 적은 토큰·2.5× 적은 step으로 확장 (ramp 함수로 차원별 차등 보간). fine-tune 없이 또는 short fine-tune으로 32k → 128k 확장
Ring Attention / Sequence Parallelism: GPU 사이로 sequence 분산. 1M token context 가능 (Gemini 1.5)
PagedAttention (vLLM): KV cache를 OS의 가상 메모리처럼 page 단위로 관리. 기존 시스템 60-80% 메모리 단편화 → vLLM <4%, throughput 2~4×

3.11 추론 최적화 — Speculative Decoding과 Prefix Caching

운영자에게 가장 직접적인 비용 절감 기법.

Speculative Decoding: 작은 draft 모델이 K개 토큰을 미리 예측하고, 큰 target 모델이 한 번의 forward로 검증·수락. EAGLE-3 + vLLM 조합으로 코드 워크로드 output 토큰당 ~19% 비용 절감 보고. 챗봇·에이전트의 latency·비용 동시 개선.
Prefix Caching (Automatic Prefix Caching, APC): 시스템 프롬프트·few-shot·RAG context 같은 공통 prefix의 KV를 캐시. 다음 요청에서 prefill 비용 대폭 절감 (TTFT 감소). vLLM·SGLang 표준 기능.
Continuous batching: 요청을 토큰 단위로 동적 batch. 한 요청 종료 시 빈 자리에 새 요청 즉시 합류 — GPU 활용률 ↑
FlashDecoding / FlashDecoding++: decode 단계 (batch 1, sequence 길이 1) 최적화 — long-context single-stream 추론에 유리

이 네 기법이 함께 적용되면 같은 GPU에서 처리량이 5~10배까지 늘어난다 (vLLM 기준).

3.11.1 MoE (Mixture of Experts)

활성화될 expert를 동적으로 선택해 active parameter < total parameter. 추론 비용 절감.

Mixtral 8×7B: 8 experts, top-2 활성화. total ~47B, inference active ~13B
DeepSeek-V3: 1 shared expert + 256 routed experts, top-8 routed 활성화. total 671B, active 37B
routing 손실: load balancing을 위한 auxiliary loss. DeepSeek-V3는 auxiliary-loss-free load balancing 도입
Multi-Token Prediction (MTP): DeepSeek-V3는 한 번에 여러 토큰을 예측 — 학습·추론 효율 ↑
운영 영향: 메모리는 total parameters 기준, 비용은 active 기준 → 메모리 vs 비용 분리

3.12 비-LLM 트랜스포머 (간략)

ViT (Vision Transformer): 이미지를 patch로 쪼개 토큰화 → transformer
CLIP: 이미지·텍스트 dual encoder, contrastive learning
Whisper: 오디오 → 텍스트 encoder-decoder
AlphaFold 2: 단백질 구조 예측에 attention 변형

→ “트랜스포머”는 LLM 전용 아키텍처가 아니라 보편 sequence/set 처리 도구.

3.13 깨지는 조건 정량 표 (운영 결정용)

기법	효과 발휘 범위	깨지는 조건
GQA (group=8)	8B~70B, KV cache 핵심	<1B 모델은 효과 미미, group=1(MQA)은 품질 손실
MQA	추론 throughput 우선	품질 4~7% 하락 (long-context 작업) — Voyage 보고
Sliding Window (W=4k)	local context 작업	long-range dependency 필요 작업에서 정확도 폭락
RoPE scaling 4×	4× extension	factor 8× 이상은 fine-tune 없이 품질 폭락
Speculative decoding	output token 多 (>100)	output 짧으면 (<20 tok) 오버헤드만 늘어남
Prefix caching	같은 prefix 재사용 多	prefix 자주 변경되면 cache miss로 비용↑
MoE	메모리 充·throughput 우선	inference batch 1에선 expert 활용률 낮아 비효율
FlashAttention	seq_len > 1k	seq_len < 256 짧은 작업은 SDPA로 충분 (오버헤드 X)

3.14 Silent Failure 시나리오와 복구

운영자가 자주 만나는 트랜스포머 silent degradation.

증상	정량 시그널	원인	복구
KV cache OOM	GPU memory >95%, batch 못 들임	seq_len·batch 곱이 한도 초과	batch↓, seq_len cap, paged attention(vLLM)
RoPE scaling 후 품질 폭락	long-context perplexity ↑ 50%+	factor 너무 큼	YaRN으로 부드러운 보간, short fine-tune
Speculative draft 거부율↑	acceptance rate <30%	draft 모델 품질 약함	EAGLE-3로 교체, draft 모델 fine-tune
Prefix cache miss 폭증	hit ratio <30%	system prompt 자주 변경	prompt 구조 분리(고정 prefix + 동적 suffix)
MoE 활용 unbalance	일부 expert 활성화율 >30%	routing 학습 부족	auxiliary load balancing loss↑, fine-tune
Long-context 정확도 폭락	NIAH 점수 50%+ 하락	lost-in-middle 또는 RoPE 한계	document order 재배치, RAG로 우회

3.15 트랜스포머의 일반 매핑 (Transferable Pattern)

attention의 핵심 — query-key 매칭 후 value 가중 합 — 은 다른 시스템에서도 반복되는 패턴.

Attention 구성요소	일반 시스템 매핑
Query (찾는 것)	DB query, 검색 쿼리, cache lookup key
Key (인덱스)	DB index, 검색 inverted index, hash key
Value (실제 데이터)	DB row, 검색 document, cache value
`Q K^T` (유사도)	DB join condition, 검색 BM25/cosine
Softmax (정규화)	weighted vote, soft selection
Multi-head (병렬)	DB partitioning, 분산 인덱스, 다중 perspective
Causal mask	append-only log, time-series 미래 차단
KV cache	session cache, streaming aggregation buffer

일반 공식: “맞춤 검색 + 가중 결합”이 attention의 본질이고, 데이터 시스템 전반에 같은 패턴이 있다. 새 attention 변형(Linear, Sparse, Mamba 등)을 만났을 때도 이 4단계(Q·K·V·합산)로 분해해 분석.

4. 실무에서 어디에 쓰이나

LLM 추론 (모든 chat·completion API)
임베딩 모델 (OpenAI, Cohere, BGE, gte, e5)
번역 (Marian, NLLB)
코드 모델 (CodeLlama, DeepSeek-Coder)
멀티모달 (CLIP, BLIP, LLaVA, GPT-4o vision)
음성 (Whisper)
단백질 (AlphaFold)

운영 시나리오 — LLM 추론 인스턴스 결정 (예시)

상황: 사내 챗봇, P95 TTFT < 1s 요구, 동시 100 요청
선택지 (LLaMA-3-8B 자체 호스팅):
  A. 단일 H100 80GB + vLLM:
     - KV cache 32k context = 4GB × 100 batch = 400GB → OOM!
     - context 4k로 제한 → KV cache ~50GB OK
  B. PagedAttention + continuous batching:
     - KV 단편화 60% → <4%, throughput 2~4× ↑
     - 동시 100 요청 가능, P95 TTFT ~800ms
  C. Speculative decoding (EAGLE-3) 추가:
     - output token latency 2~3× ↓ → P95 TTFT ~400ms
     - draft 모델 추가 메모리 ~2GB

선택: B + C. context 4k + PagedAttention + Spec dec.
대안 비선택: A 단독은 throughput 약함, C 단독은 메모리 효과 X.
결과 (가상): P95 TTFT 350ms, throughput 80 req/s.

§3.8 KV cache + §3.10 long-context + §3.11 추론 최적화 + §3.13 깨지는 조건 모두 적용.

5. 현재 내 업무와 연결점

플랫폼 엔지니어가 LLM 운영할 때 트랜스포머 직관이 다음에 도움 된다.

API 비용 직관: input은 prefill(병렬 가능), output은 decode(순차) — output 토큰이 보통 4~10배 비싼 이유. context length는 KV cache로 비용 누적
batch 결정: batch size↑ → throughput↑이지만 KV cache가 batch에 곱해져 메모리 폭증. continuous batching·paged attention(vLLM)이 운영 표준
모델 크기 선택: 같은 token throughput에서 7B → 70B는 비용 10배 + latency 4배. GQA·MoE 모델이 동일 품질에 더 싼 경우 많음
long-context 비용 견적: 128k context는 8GB-16GB KV cache → 한 GPU에 여러 요청 못 띄움. RAG와의 trade-off (L12-30)

6. 자주 헷갈리는 개념 비교

개념 A	개념 B	차이점
Self-attention	Cross-attention	같은 sequence 안 vs 다른 두 sequence (encoder-decoder)
Multi-head	Single-head	병렬 다중 패턴 vs 단일 표현
Encoder-only	Decoder-only	양방향 vs 단방향(causal). LLM은 후자
Sinusoidal vs RoPE	Learned positional	절대·고정 vs 회전·상대 vs 학습 가중치
MHA / MQA / GQA		head별 KV 분리 / 모두 공유 / group 공유. cache 메모리 차이
Dense	MoE	전체 활성화 vs 일부 expert 활성화 (active < total)
KV cache	activation memory	inference 캐시 vs 학습 backward용 forward 결과 저장
FlashAttention	Sparse Attention	정확한 attention의 IO 최적화 vs 일부 attention 생략
Prefill	Decode	input 병렬 처리 vs output 순차 1토큰씩
Transformer	Mamba / SSM	O(n²) attention vs O(n) state space. 새 후보군

7. 체크리스트

Self-attention 수식 softmax(QK^T/√d_k)V를 행렬 곱과 softmax로 단계별 설명할 수 있다
√d_k로 나누는 이유를 분산 관점에서 설명할 수 있다 (L11-10 §3.2 참고)
KV cache 메모리 식을 자릿수까지 추정할 수 있다 (예: LLaMA-3-8B 32k context = ?GB)
LLM context length가 길수록 비용이 O(n²)인 이유와 FlashAttention이 메모리만 O(n)으로 줄이는 이유를 구분 설명할 수 있다
GQA·MQA가 KV cache를 줄이는 메커니즘과 품질 trade-off를 설명할 수 있다
Decoder-only가 LLM 표준이 된 이유를 in-context learning 관점에서 말할 수 있다
RoPE가 long-context 외삽에 유리한 이유 (NTK-aware/YaRN scaling)를 설명할 수 있다
MoE 모델의 “active vs total parameter” 구분이 메모리·비용에 미치는 영향을 설명할 수 있다

8. 추가 학습 키워드

Attention 변형: MHA, MQA, GQA, MLA, Linear Attention, Sparse, Sliding Window
Positional: RoPE, ALiBi, NoPE, NTK-aware, YaRN
블록 구조: Pre-norm, RMSNorm, SwiGLU, GeGLU, residual
Long-context 인프라: Ring Attention, Sequence Parallelism, paged attention (vLLM)
추론 최적화: continuous batching, speculative decoding, PagedAttention, FlashDecoding
MoE: top-k routing, load balancing, expert parallelism
대안: Mamba/SSM, Hyena, RWKV, Jamba (hybrid)

9. 내가 직접 확인해볼 것

Attention 직접 구현

PyTorch로 (n=4, d_k=8) self-attention을 numpy 수준으로 구현. softmax(QK^T/√d_k)V가 nn.MultiheadAttention과 일치하는지 확인
causal mask 적용해 i번째 row의 future 위치가 0인지 확인

KV cache 측정

HuggingFace transformers에서 같은 prompt를 use_cache=True/False로 generate해 시간 비교. 예상: cache off가 5~10배 느림
LLaMA-3-8B로 context 길이 1k/4k/16k에서 KV cache 메모리(torch.cuda.memory_allocated()) 비교. 예상: 선형 증가 (2k당 ~1GB)

모델 비교

LLaMA-2-7B(MHA) vs LLaMA-3-8B(GQA)의 KV cache 메모리를 같은 context에서 비교 — GQA가 4~8배 작아야 함
Mistral 7B vs LLaMA-2-7B의 long-context(8k+) latency 비교 — sliding window 효과 체감

Long-context

HuggingFace transformers로 RoPE scaling(rope_scaling={"type": "yarn", "factor": 4.0}) 적용 모델로 32k 입력 처리

결과가 예상과 다를 때

KV cache off가 더 빨라 보이는 경우 → batch 1 + 짧은 prompt면 cache 초기화 비용이 더 큼. 긴 prompt·output에서 효과 큼
GQA가 메모리 안 줄어드는 경우 → KV cache는 줄지만 forward activation은 그대로. seq_len 짧으면 효과 미미
RoPE scaling 후 품질 폭락 → factor 너무 큼. 4.0 → 2.0 시도, fine-tune 추가 필요

10. 5줄 요약

트랜스포머는 attention 기반 sequence 모델, self-attention은 softmax(QK^T/√d_k)V로 토큰 간 관계를 동적으로 계산한다.
Multi-head + RoPE + Pre-norm + SwiGLU FFN + residual이 현대 LLM 블록 표준이다.
KV cache는 inference 메모리의 대부분을 차지하며, GQA·MQA·MLA·quantization이 이를 줄이는 표준 기법이다.
attention은 O(n²) 연산·메모리지만 FlashAttention으로 메모리를 O(n)까지 끌어내려 long-context 학습이 가능해졌다.
MoE는 active parameter < total parameter로 비용을 줄이고, Mamba/SSM은 트랜스포머의 O(n²) 한계에 대한 대안 후보다.

11. 출처

최종 수정: 2026-04-26