도구·함수 호출

분류: Layer 12 - AI 시스템 & LLM 애플리케이션 | 선수지식: L12-10 (LLM API), L12-20 (Prompt)

도구·함수 호출 — Function Calling, MCP, Parallel Tools

1. 한 줄 정의

도구 호출은 LLM이 외부 함수·API·도구를 안전하게 호출할 수 있게 하는 표준 인터페이스이며, function calling (OpenAI 표준), tool use (Anthropic), MCP (Anthropic 2024 오픈 프로토콜)가 대표 형태다. agent의 토대.

2. 왜 중요한가

Agent의 출발점: tool 없는 agent는 없음 (L12-60)
외부 세계 연결: DB·API·파일시스템·web·이메일 — LLM의 knowledge cutoff 보완
action 보안: 잘못된 tool 호출은 데이터 손실·비용 폭증·외부 사고
표준화: 2024-2025 MCP가 LLM·tool·data source 사이 표준 프로토콜로 부상
운영 평가: BFCL 같은 벤치마크가 tool calling 정확도를 정량화

등장 배경 — pre-function-calling의 한계와 N+M으로의 도약

toolless 시대 (2023-06 이전): LLM에게 “외부 호출이 필요하면 [CALL get_weather(Seoul)] 형식으로 출력하라”고 prompt로 지시하고 클라이언트가 regex·BNF로 파싱. 모델이 형식을 흐트러뜨리면 무음 실패하고 JSON schema 검증이 불가능. 또는 모델이 그냥 “Weather API에 GET하면 됩니다”라고 자연어로 답해 자동화가 끊김. L12-20 prompt-engineering에서 다룬 fragile parsing의 끝판.

1단계 — Function Calling (OpenAI, 2023-06-13): gpt-4-0613·gpt-3.5-turbo-0613이 JSON schema 명세를 받아 function_call 필드로 구조화된 호출을 반환하기 시작 (출처: OpenAI 발표). regex 파싱 폐기, JSON schema validator로 silent failure 차단 가능. 한계: provider별 호출 형식이 따로따로(§3.2)라 Anthropic·Gemini로 옮길 때마다 어댑터를 새로 작성.

2단계 — N×M 통합 폭발 (2023–2024): M개 LLM 앱이 N개 외부 서비스(Slack·GitHub·Postgres·Notion …)를 호출하려면 어댑터가 M×N개 필요. 한 SaaS가 새 LLM 호환을 추가할 때마다 자사 wrapper를 새로 작성해 통합 비용이 곱셈으로 증가 (출처: Wikipedia — Model Context Protocol, N×M 문제).

3단계 — MCP (Anthropic, 2024-11-25): JSON-RPC 기반 표준 protocol로 통합 비용을 N+M으로 강등. tool 제공자는 MCP server 하나만 만들면 모든 MCP 호환 client(Claude Desktop·Cursor·Cline·Windsurf)에서 즉시 사용 가능. function calling은 provider별 호출 형식을 표준화했고, MCP는 tool 자체의 배포·발견·권한 분리를 한 층 위에서 표준화한다 — 대체가 아니라 결합 (LLM이 tool을 부를 때는 여전히 각 provider의 function calling 형식을 거치고, server 쪽 인터페이스만 표준).

이 토픽이 사라지면 깨지는 것: agent의 자동화 루프(L12-60), code assistant의 파일 편집, RAG의 retrieval action 모두 tool calling 위에 서 있다. regex 파싱 시대로 회귀하면 schema validator 자체가 없어 hallucinated 호출이 silent로 production까지 흘러간다.

3. 핵심 개념

3.1 Function Calling 기본

LLM에 tool 명세를 주고, LLM이 호출할 함수와 파라미터를 JSON으로 반환.

// 1. Tool 정의 (OpenAI 형식)
const tools = [
  {
    type: "function",
    function: {
      name: "get_weather",
      description: "도시의 현재 날씨를 가져옴",
      parameters: {
        type: "object",
        properties: {
          city: { type: "string", description: "도시 이름" },
          unit: { type: "string", enum: ["celsius", "fahrenheit"] },
        },
        required: ["city"],
      },
    },
  },
];

// 2. LLM 호출 — tools 전달
const resp = await openai.chat.completions.create({
  model: "gpt-4o",
  messages: [{ role: "user", content: "서울 날씨?" }],
  tools,
});

// 3. LLM이 tool_call 반환
// resp.choices[0].message.tool_calls[0]:
//   { name: "get_weather", arguments: '{"city":"Seoul","unit":"celsius"}' }

// 4. 실행 → 결과를 messages에 다시 넣음
// 5. LLM이 결과 보고 최종 답변 생성

3.2 Provider별 형식 차이

Provider	명칭	구조
OpenAI	function calling	`tools[].function.parameters` (JSON schema)
Anthropic	tool use	`tools[].input_schema` (JSON schema)
Gemini	function calling	`tools[].functionDeclarations[].parameters`
LangChain	`@tool` decorator	Pydantic·dataclass·Zod 자동 schema
LiteLLM	OpenAI 호환	provider 추상화

대부분 JSON schema 기반이고 인터페이스 거의 호환. provider abstraction 라이브러리(LiteLLM, LangChain) 사용 시 통일 가능.

3.3 Tool Schema 설계 원칙

좋은 tool 설명이 LLM의 호출 정확도를 좌우.

Description은 상세히: “도시 날씨 가져옴”보다 “도시 이름을 받아 현재 기온·습도·강수 확률을 JSON으로 반환. city는 영문 도시명 (Seoul, Tokyo, …)”
Parameter 의미·예시: {"city": "Seoul (반드시 영문, 한글 안 됨)"}
Required vs optional 명시
enum으로 제약: unit: ["celsius", "fahrenheit"]
이름이 동작을 표현: get_user_email_by_id는 fetch보다 명확
Tool 개수: 보통 ≤20개. 많으면 hallucinated tool 호출 위험. agent가 너무 많은 tool 가지면 sub-agent로 분리

3.4 Parallel Tool Calling

LLM이 한 번에 N개 tool 동시 호출 (OpenAI, Anthropic, Gemini 모두 지원).

// LLM 응답:
tool_calls: [
  { name: "get_weather", arguments: '{"city": "Seoul"}' },
  { name: "get_news", arguments: '{"topic": "tech"}' },
];

// 클라이언트가 병렬 실행 후 두 결과 모두 LLM에게 반환

장점: latency ↓, 다중 정보 동시 수집
단점: 의존성 있는 호출은 순차 처리해야 (sequential agent)

깨지는 silent failure와 차단 플래그

같은 tool 중복 호출 회귀: gpt-4.1-nano-2025-04-14는 parallel 활성 시 동일 tool을 의미 없이 2–3회 중복 호출하는 회귀가 보고됨 (출처: vercel/ai issue #7517). 비용·rate-limit 양쪽으로 폭주하나 모델 응답만 보면 정상 응답으로 보여 silent.
shared-state write race: 모델이 update_user(id=1, name=X)와 delete_user(id=1)를 같은 turn에 병렬로 결정 → 클라이언트의 실행 순서에 따라 결과가 비결정적. 두 tool 모두 200 OK를 돌려주면 LLM 입장에서는 모든 게 성공한 것처럼 보임.
차단 플래그: OpenAI는 요청에 parallel_tool_calls: false, Anthropic은 disable_parallel_tool_use: true를 넘기면 모델이 turn당 최대 1개 tool만 호출 (출처: OpenAI function calling guide, AI SDK Anthropic provider 옵션). 운영 기본값: read-only tool만 parallel 허용, mutate/write tool(delete_*, update_*, send_*)이 섞이는 agent는 위 플래그로 차단 후 sequential로 강제.

3.5 Tool Execution 흐름

Single-turn

user → LLM → tool call → tool 실행 → LLM → answer

Multi-turn (agentic loop)

user → LLM → tool A → result → LLM → tool B → result → LLM → answer

L12-60 agent에서 깊이 다룸.

ReAct 패턴 결합 (L12-20 §3.5)

Thought: 날씨 정보 필요
Action: get_weather("Seoul")
Observation: 22℃, 맑음
Thought: 충분한 정보
Answer: ...

3.6 MCP — Model Context Protocol

Anthropic 2024-11에 공개한 LLM·도구·데이터 소스 사이 오픈 프로토콜. JSON-RPC 기반.

[Client (LLM 앱)]  ←→  [MCP Server (도구 제공자)]
                       (DB, 파일시스템, GitHub, Slack, ...)

핵심 개념

Resources: 읽을 데이터 (파일, DB row)
Tools: 실행할 액션 (write file, send slack)
Prompts: 재사용 가능한 prompt template

가치

표준화: tool 마다 따로 SDK 작성 안 해도 됨. MCP 호환 server 하나면 모든 MCP 호환 client에서 사용 가능
ecosystem: GitHub, Slack, Notion, Postgres 등 200+ official/community MCP server (2025)
보안: server마다 권한 분리, sandboxing 가능
Anthropic Claude·Cursor·Cline·Windsurf 등이 MCP 호환

결정 기준 — MCP server 채택 vs 자체 SDK 유지

다음 신호 중 2개 이상이면 MCP, 그렇지 않으면 자체 SDK가 정답인 경우가 많다.

✅ MCP로 가는 신호
- GitHub·Slack·Postgres·Notion 등 공식 MCP server가 이미 존재 (코드 작성량 자체가 사라짐)
- 외부 client(Claude Desktop, Cursor, 사내 IDE 플러그인)에서 우리 tool을 부를 가능성
- tool이 5개 이상이고 권한·sandboxing이 client마다 달라지는 멀티-tenant 사용처
❌ 자체 SDK 유지 신호
- tool 1–2개이고 LLM 호출도 단일 앱 안에서만 일어남 → MCP server·client·transport 3계층은 over-engineering
- 사내 protobuf/gRPC + tenant별 JWT 권한 분리 + 비공개 schema가 핵심 자산 → JSON-RPC + stdio/SSE로 옮기는 비용이 더 큼
- audit log·rate limit을 자체 미들웨어에서 강제하던 경로가 있다면 MCP server proxy를 새로 끼워야 유지됨 (운영 사례: GitHub MCP server 도입으로 자체 REST wrapper 코드는 줄지만 감사 로그 미들웨어는 server 측에 별도 구현 필요)
transport 선택은 MCP 공식 사양에서 stdio/SSE/Streamable HTTP 옵션을 먼저 확인. 사내 네트워크에 stdio 자식 프로세스 허용이 막혀 있으면 채택 자체가 불가능해질 수 있음.

3.7 Tool Calling 평가 — BFCL

Berkeley Function Calling Leaderboard (BFCL v3, 출시: 2024-09-19, 마지막 업데이트: 2024-12-10) — tool calling 표준 벤치마크. 출처: gorilla-llm/gorilla GitHub, 공식 사이트.

평가 영역

BFCL v3는 4,441개 평가 항목으로 구성된다.

Non-Live (Single-Turn) 1,390건: Simple, Multi-function, Parallel 호출 정확도 (AST·실행 기반 평가)
Live (Single-Turn) 2,251건: 실제 production 유사 시나리오 (가중 평균)
Multi-Turn 800건: 다단계 tool 호출. 단순 파라미터 매칭 대신 state-based 평가 — API 시스템 실제 상태 변화를 검증
Hallucination 240~900건: 존재하지 않는 tool 호출 여부

점수 해석

BFCL 점수는 0~1 범위의 정확도 값이다. 2025년 4월 기준 주요 참고값 (출처: llm-stats.com/benchmarks/bfcl):

리더보드 1위권 모델: ~~0.77~~0.80 수준
전체 모델 평균: ~0.70 수준

문서 내 “≥85, <70” 임계값은 공식 기준이 아닌 커뮤니티 경험적 추정이다. BFCL v3 공식 문서에는 합격/불합격 임계값이 명시되어 있지 않으며, 실제 운영 결정에는 리더보드 순위와 자체 도메인 평가를 함께 사용하는 것이 권장된다.

운영 시사

BFCL 점수 높은 모델 = tool agent에 적합 (Claude 4.x, GPT-4o, o3, Llama 4)
BFCL 점수 낮은 모델 = function calling 약함 → fallback 필요

3.8 Tool Calling Silent Failure

운영자가 가장 자주 만나는 함정.

Hallucinated tool: LLM이 존재하지 않는 tool 호출 → 클라이언트가 무시
Parameter type 오류: {"age": "twenty"} (숫자여야 함) → JSON schema validation에서 발견
Tool 누락: 필요한 tool 호출 안 함 → “tools 사용해서 답하라” prompt 강화
Tool 무한 호출: agent loop에서 같은 tool 반복 → max iteration 제한
Permission escalation: 사용자 권한 없는 작업을 tool로 호출 → tool 실행 시 permission check
Stale data: tool이 캐시된 옛 데이터 반환 → cache TTL 명시
Cost explosion: tool 안에 비싼 LLM 호출 → tool별 cost budget

디버깅 시그널

증상	원인	대응
LLM이 tool 호출 안 함	description 부족, prompt 약함	description 개선, “tools를 사용해야 한다”
잘못된 tool 호출	description 모호, 유사 이름 多	tool 분리, description 차별화
Parameter 오류 빈번	schema 모호	enum, description 강화, JSON schema strict
Tool 무한 반복	LLM stuck	max_iter 제한, scratchpad에 history 명시
Output이 사용자에게 누출	error message 노출	tool error를 LLM에게만 보여주고 사용자엔 sanitize

3.9 보안 — Tool Permission과 Sandboxing

Principle of least privilege: tool마다 필요한 최소 권한
Allowlist: 어떤 tool을 어떤 user가 호출 가능한지 명시
Sandboxing: 코드 실행 tool은 격리된 컨테이너 (E2B, Modal, Anthropic Code Execution sandbox)
Confirmation: 비가역 액션 (이메일 전송·결제·DB 삭제)은 사용자 승인 필요
Audit log: 모든 tool 호출을 기록 — 사고 시 분석
Prompt injection 방어: tool result에 prompt injection 가능 (외부 데이터에 숨겨진 지시) → trusted/untrusted 분리

3.10 Anthropic Computer Use API

2024-10 Anthropic이 공개한 새 패러다임 — LLM이 화면을 보고 마우스·키보드를 제어.

LLM이 screenshot 보고 → "click(x,y)" 또는 "type('hello')" 같은 action 반환
→ 클라이언트가 OS·browser에 실제 입력
→ 다음 screenshot → ...

일반 GUI 자동화 (사람이 쓰는 모든 앱)
단점: latency 큼 (화면당 LLM 호출), 정확도 한계, 보안 위험
사례: Claude Code, Anthropic 자체 Browser Use, OpenAI Operator (2025)

3.11 Code Execution as Tool

LLM이 코드를 작성하고 sandbox에서 실행 → 결과를 답변에 활용.

OpenAI Code Interpreter / Advanced Data Analysis
Anthropic Code Execution (2025)
E2B, Modal, Daytona: serverless code execution sandbox
데이터 분석·계산·시각화에 강함
보안: 절대 격리된 sandbox 안에서만

3.12 Tool Use를 위한 Fine-tuning

특정 도메인 tool을 잘 호출하게 fine-tune.

데이터: (prompt, expected tool calls) 쌍
HuggingFace TRL SFTTrainer: tool use 데이터로 SFT
사례: Hermes 3, NousResearch가 tool use 강화 모델 공개
운영 시사: BFCL 점수 약한 open-weight 모델을 도메인 데이터로 fine-tune해 끌어올림

3.13 깨지는 조건 정량 표 (운영 결정용)

기법	효과 발휘 범위	깨지는 조건
Function calling	한정 도구 (≤20)	tool 30+ → hallucinated tool, sub-agent 분리
Parallel tool calling	독립 작업	의존성 있는 호출 → sequential
MCP 활용	표준 ecosystem, 200+ server	도메인 전용 SDK 작성이 더 단순한 경우
Computer Use	일반 GUI 자동화	latency·정확도 한계 — function calling 우선
Code Execution	데이터 분석·계산	격리 sandbox 필수 (E2B, Modal, Daytona)
Allowlist	사용자별 권한 분리	너무 좁으면 정상 작업 거절
Confirmation gate	비가역 액션 (이메일·결제)	너무 자주 적용 → UX 저하
Max_iter	agent stuck 방지	너무 작으면 정상 multi-step도 잘림
BFCL 점수 상위권	tool agent 적합	하위권은 fine-tune 또는 fallback 필요 (커뮤니티 경험적 기준, 공식 임계값 없음)

3.14 Tool Calling Silent Failure 복구 절차

증상	정량 신호	원인	복구
Hallucinated tool	클라이언트가 무시한 tool 호출	description 모호	description 강화, JSON schema strict
Parameter type 오류	5%+ 호출 실패	schema 모호	enum, required 강화, retry with error msg
Tool 누락	LLM이 단답으로만 답	description 부족, prompt 약함	”must use tools” prompt + 명시적 description
무한 호출	iter > 20	LLM stuck	max_iter 제한, scratchpad에 history
Permission escalation	권한 없는 작업 호출	allowlist 미적용	per-user tool allowlist
Stale data	tool이 옛 데이터 반환	cache TTL 무관	TTL 명시, freshness 보증
Cost runaway	호출당 비용 10×↑	tool 안에 비싼 LLM 호출	tool별 budget cap, alert
Output 사용자 노출	error message 누출	sanitize 누락	tool error는 LLM에만, 사용자엔 generic msg

3.15 Tool Calling의 일반 매핑 (Transferable Pattern)

LLM tool calling = “schema 기반 RPC 호출”. 다른 분산 시스템과 같은 패턴.

Tool calling 구성요소	일반 시스템 매핑
Tool definition (schema)	gRPC service definition, OpenAPI spec, GraphQL
Function signature	RPC method signature, REST endpoint
Parameter validation	DTO validation, request body schema (Joi, Zod)
MCP (JSON-RPC)	LSP·DAP (Language·Debug protocol), gRPC, SOAP
Parallel tools	concurrent RPC, GraphQL DataLoader
Tool allowlist	RBAC, IAM policy, API gateway authn
Sandbox	container isolation, AWS Lambda, Cloudflare Workers
Confirmation gate	2FA, manual approval workflow

일반 공식: “schema → 호출 → 결과 → 검증”의 4단계 RPC 패턴. LLM tool calling이 특별한 점은 자연어로 호출 결정이라는 것이며, 이 비결정성을 schema·allowlist·sandbox로 다스린다.

운영 시나리오 — 사내 자동화 agent tool 권한 사고 (예시)

상황: 사내 Slack agent (이메일·캘린더·DB 조회 tool 8개)
사고: 사용자가 "지난주 정책 이메일 모두 삭제해줘" → agent가 실제 실행 (50건 삭제)

원인 분석 (trace):
  1. tool delete_email allowlist 미적용 (모든 user 호출 가능)
  2. confirmation gate 누락 (비가역 액션인데 즉시 실행)
  3. audit log 없어 사후 추적 어려움

복구:
  1. 즉시 tool delete_email를 allowlist (admin only)
  2. confirmation gate 추가 (비가역 액션 사용자 승인 후 실행)
  3. audit log 강제 (모든 tool 호출 기록)
  4. 백업에서 이메일 복구

대안 비선택:
  - tool 자체 제거 X (정상 use case 있음)
  - LLM 변경 X (alignment 아니라 권한 문제)

silent failure 체크 (§3.14):
  - hallucinated tool, parameter 오류, permission escalation, cost runaway

§3.3 schema + §3.9 보안 + §3.13 깨지는 조건 + §3.14 silent failure 모두 적용.

4. 실무에서 어디에 쓰이나

챗봇이 외부 API 호출 (날씨, 검색, DB)
코드 어시스턴트 (파일 읽기·쓰기·실행)
데이터 분석 agent (SQL 생성·실행·시각화)
고객 지원 (티켓 조회·생성)
일정 관리 (캘린더·이메일)
사내 도구 자동화 (Slack·Notion·Jira)
Browser·OS 자동화 (Computer Use)

5. 현재 내 업무와 연결점

플랫폼 엔지니어가 tool calling을 운영할 때 다음에 도움된다.

MCP 도입 결정: 자체 tool SDK 작성 vs MCP server 활용. 200+ official MCP가 있어 hybrid가 합리적
Tool schema 설계: description 상세도가 호출 정확도 결정 — naming convention 표준화
Tool registry: tool도 prompt처럼 코드 자산. 버전 관리·A/B·회귀 평가
권한·sandboxing: code execution·file write 같은 위험 tool은 격리 + confirmation
모니터링: tool 호출 latency·실패율·비용 dashboard
모델 선택: BFCL 점수 기준으로 tool-heavy 작업에 적합한 모델

6. 자주 헷갈리는 개념 비교

개념 A	개념 B	차이점
Function calling	Tool use	OpenAI 명칭 vs Anthropic 명칭. 거의 같은 개념
Single tool	Parallel tools	한 번에 1개 vs N개 동시
Single-turn	Multi-turn (agentic)	1번 호출 vs 반복 loop
Function calling	MCP	API별 SDK vs JSON-RPC 표준 protocol
Function calling	Computer Use	명시 schema 호출 vs 화면 보고 마우스·키보드
Tool description	Tool name	무엇·언제 사용 vs 식별자
Allowlist	Confirmation	호출 가능한 tool 제한 vs 실행 전 사용자 승인
BFCL Simple	Multi-turn	단일 호출 정확 vs 다단계 흐름
Code Interpreter	shell·terminal tool	격리 sandbox vs 호스트 직접 실행 (위험)

7. 체크리스트

OpenAI/Anthropic/Gemini의 function calling schema 차이를 설명할 수 있다
Tool description 좋은 예/나쁜 예를 구분할 수 있다 (상세도, enum, required)
Parallel tool calling이 latency를 줄이는 메커니즘을 설명할 수 있다
MCP의 Resources·Tools·Prompts 3개 개념과 가치(표준 protocol)를 말할 수 있다
BFCL 5개 영역(simple/multi-function/multi-turn/parallel/live)을 구분할 수 있다
Tool calling silent failure 7종(hallucinated/param 오류/누락/무한호출/permission/stale/cost)을 식별·대응할 수 있다
Code execution·Computer Use의 sandboxing·confirmation·audit log 보안 패턴을 설명할 수 있다

8. 추가 학습 키워드

Function calling: OpenAI, Anthropic, Gemini, JSON schema, parallel tools, structured output strict
MCP: Model Context Protocol, JSON-RPC, MCP server, MCP client, Resources, Tools, Prompts
Code execution: E2B, Modal, Daytona, Anthropic Code Execution, OpenAI Code Interpreter
Computer Use: Anthropic Computer Use, OpenAI Operator, Browser Use, ChromeDevTools MCP
운영 도구: LangChain @tool, LlamaIndex tools, Vercel AI SDK tools, LiteLLM, BAML
평가: BFCL v3, ToolBench, τ-bench, NexusRaven, ToolEmu
보안: prompt injection in tool results, principle of least privilege, audit log, sandboxing

9. 내가 직접 확인해볼 것

기본 호출 — OpenAI function calling

# pip install openai
import json
from openai import OpenAI

client = OpenAI()  # OPENAI_API_KEY 환경변수 필요

tools = [
    {
        "type": "function",
        "function": {
            "name": "get_weather",
            "description": "도시 이름을 받아 현재 기온·날씨 상태를 반환. city는 영문 도시명 (예: Seoul, Tokyo)",
            "parameters": {
                "type": "object",
                "properties": {
                    "city": {"type": "string", "description": "영문 도시명, 예: Seoul"},
                    "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
                },
                "required": ["city"],
            },
        },
    }
]

resp = client.chat.completions.create(
    model="gpt-4o",
    messages=[{"role": "user", "content": "서울 날씨 알려줘"}],
    tools=tools,
)

# ── 성공 판정 기준 ──────────────────────────────────────────
# 1. finish_reason == "tool_calls"  (아니면 tool 호출 안 됨)
# 2. resp.choices[0].message.tool_calls 리스트 존재
# 3. tool_calls[0].function.name == "get_weather"
# 4. json.loads(tool_calls[0].function.arguments)["city"] 존재
# ───────────────────────────────────────────────────────────

choice = resp.choices[0]
print("finish_reason:", choice.finish_reason)       # 기대값: "tool_calls"
tool_call = choice.message.tool_calls[0]
print("tool name   :", tool_call.function.name)     # 기대값: "get_weather"
args = json.loads(tool_call.function.arguments)
print("arguments   :", args)                        # 기대값: {"city": "Seoul", ...}

예상 응답 구조 (출처: OpenAI function calling guide):

{
  "finish_reason": "tool_calls",
  "message": {
    "role": "assistant",
    "content": null,
    "tool_calls": [
      {
        "id": "call_abc123",
        "type": "function",
        "function": {
          "name": "get_weather",
          "arguments": "{\"city\": \"Seoul\", \"unit\": \"celsius\"}"
        }
      }
    ]
  }
}

성공: finish_reason == "tool_calls" + tool_calls 필드 존재
실패: finish_reason == "stop" → tool 호출 없이 텍스트로 답변 → description 강화 필요

기본 호출 — Anthropic tool use

# pip install anthropic
import anthropic

client = anthropic.Anthropic()  # ANTHROPIC_API_KEY 환경변수 필요

tools = [
    {
        "name": "get_weather",
        "description": "도시 이름을 받아 현재 기온·날씨 상태를 반환. city는 영문 도시명 (예: Seoul, Tokyo).",
        "input_schema": {
            "type": "object",
            "properties": {
                "city": {"type": "string", "description": "영문 도시명, 예: Seoul"},
                "unit": {"type": "string", "enum": ["celsius", "fahrenheit"]},
            },
            "required": ["city"],
        },
    }
]

resp = client.messages.create(
    model="claude-opus-4-5",
    max_tokens=1024,
    tools=tools,
    messages=[{"role": "user", "content": "서울 날씨 알려줘"}],
)

# ── 성공 판정 기준 ──────────────────────────────────────────
# 1. resp.stop_reason == "tool_use"
# 2. resp.content 안에 type == "tool_use" 블록 존재
# 3. tool_use 블록의 name == "get_weather"
# 4. tool_use 블록의 input["city"] 존재
# ───────────────────────────────────────────────────────────

print("stop_reason:", resp.stop_reason)             # 기대값: "tool_use"
tool_block = next(b for b in resp.content if b.type == "tool_use")
print("tool name  :", tool_block.name)              # 기대값: "get_weather"
print("input      :", tool_block.input)             # 기대값: {"city": "Seoul", ...}

예상 응답 구조 (출처: Anthropic tool use docs):

{
  "stop_reason": "tool_use",
  "content": [
    {
      "type": "text",
      "text": "서울 날씨를 확인해 드릴게요."
    },
    {
      "type": "tool_use",
      "id": "toolu_01A09q90qw90lq9",
      "name": "get_weather",
      "input": { "city": "Seoul", "unit": "celsius" }
    }
  ]
}

성공: stop_reason == "tool_use" + content에 type: "tool_use" 블록 존재
실패: stop_reason == "end_turn" → tool 호출 없이 텍스트 답변 → description 강화 필요

OpenAI vs Anthropic 핵심 차이:

항목	OpenAI	Anthropic
성공 신호	`finish_reason: "tool_calls"`	`stop_reason: "tool_use"`
tool 스키마 키	`parameters`	`input_schema`
tool 호출 위치	`message.tool_calls[]`	`content[]` (tool_use 블록)

Parallel

3개 tool (get_weather, get_news, get_stock)을 정의하고 “오늘 서울 정보 알려줘” 호출 — tool_calls 배열에 3개 항목이 오는지 확인. latency 비교 (sequential vs parallel)

MCP 체험 — Python mcp SDK 최소 서버

# pip install mcp[cli]   (Python ≥3.10, mcp SDK ≥1.2.0)
from mcp.server.fastmcp import FastMCP

mcp = FastMCP("my-demo-server")  # 서버 이름

@mcp.tool()
def add(a: int, b: int) -> int:
    """두 정수를 더해 반환한다."""
    return a + b

@mcp.tool()
def get_greeting(name: str) -> str:
    """이름을 받아 인사 문자열을 반환한다."""
    return f"안녕하세요, {name}님!"

if __name__ == "__main__":
    mcp.run()  # 기본: STDIO transport (Claude Desktop 호환)

실행 및 확인 절차 (출처: MCP 공식 build-server 가이드):

# 1. 서버 직접 실행 (STDIO 모드 — 표준 입출력으로 JSON-RPC 통신)
python my_server.py

# 2. MCP Inspector로 로컬 테스트 (브라우저 UI)
npx @modelcontextprotocol/inspector python my_server.py
# → http://localhost:5173 에서 tool 목록 및 호출 테스트 가능

# 3. Claude Desktop 연동 — ~/Library/Application Support/Claude/claude_desktop_config.json
# {
#   "mcpServers": {
#     "my-demo": {
#       "command": "python",
#       "args": ["/절대/경로/my_server.py"]
#     }
#   }
# }
# Claude Desktop 재시작 후 Claude에게 "두 수를 더해줘" 요청 시
# 내부적으로 add tool을 호출

성공/실패 판정:

성공: MCP Inspector에서 add, get_greeting 두 tool이 목록에 노출됨. 호출 시 정수/문자열 결과 반환
실패 유형:
- print() → stdout에 출력 시 JSON-RPC 메시지 손상 → print(..., file=sys.stderr) 사용
- Python 3.9 이하 → from __future__ import annotations 추가 필요
- SDK 버전 < 1.2.0 → mcp upgrade 또는 재설치

Silent failure 진단

description 모호한 tool 2개 (“get_data”, “fetch_info”) → LLM이 어느 걸 부를지 혼란. 명확화 후 비교
tool max_iter 없이 무한 loop 만들기 → max_iter=10으로 제한

Code execution

E2B sandbox로 LLM 생성 코드 실행 — host 격리 확인
OpenAI Code Interpreter로 데이터 분석 시연

결과가 예상과 다를 때

LLM이 tool 호출 안 함 → description 강화, “must use tools” prompt
잘못된 tool 선택 → tool 이름·description 차별화, top-K tool retrieval
parameter type 오류 → JSON schema strict, retry with error message
tool 호출 너무 많음 → max_iter 제한, scratchpad에 호출 이력 명시
code execution 보안 사고 → sandbox 격리 점검, allowlist된 라이브러리만

10. 5줄 요약

Tool calling은 LLM이 외부 함수·API를 호출하는 표준 인터페이스로 OpenAI function calling, Anthropic tool use, MCP 3가지가 표준이다.
Tool schema(description, parameters, required, enum) 품질이 LLM 호출 정확도의 절반을 결정.
Parallel tool calling으로 latency 절감, MCP로 ecosystem 표준화, Computer Use·Code Execution이 새 패러다임으로 등장.
BFCL 점수가 tool-heavy 작업의 모델 선택 기준이고, hallucinated tool·parameter 오류·permission·cost가 흔한 silent failure다.
Sandboxing·confirmation·audit log·allowlist가 운영 보안의 표준이고, prompt injection in tool results 방어가 새로운 공격 표면이다.

11. 출처

최종 수정: 2026-04-27