Zero dependencies · Node.js builtins only · npm version

Deterministic mock LLM server for testing

Real HTTP server. Real SSE streams. WebSocket APIs. Fixture-driven responses. Multi-provider mock — OpenAI, Claude, Gemini — any process on the machine can reach it.

$ npm install @copilotkit/llmock
fixtures/chat.json
{
  "fixtures": [
    {
      "match": {
        "userMessage": "capital of France"
      },
      "response": {
        "content": "The capital of France is Paris."
      }
    }
  ]
}
Terminal

Stop paying for flaky tests

Tests that hit real LLM APIs — OpenAI, Gemini, Anthropic — cost money, time out, and produce non-deterministic results. llmock replaces those calls with immediate, deterministic responses from a real HTTP server any process on the machine can reach.

Real HTTP Server

Runs on an actual port. Any process on the machine can reach it — Next.js, Mastra, LangGraph, Agno, anything that speaks HTTP.

📡

Authentic SSE Streams

OpenAI, Claude, and Gemini APIs — authentic SSE format for each provider. Streaming and non-streaming modes.

📁

JSON Fixture Files

Define responses as JSON — one file per feature. Load a directory, load a file, or register fixtures programmatically.

🔧

Tool Call Support

Return tool calls with structured arguments. Match on tool names, tool result IDs, or write custom predicates.

💥

Error Injection

Queue one-shot errors — 429 rate limits, 503 outages, whatever. Fires once, then auto-removes itself.

📋

Request Journal

Every request recorded. Inspect messages, verify tool calls, assert on conversation history. HTTP and programmatic access.

🔌

WebSocket APIs

OpenAI Responses, OpenAI Realtime, and Gemini Live over WebSocket. Same fixtures, real RFC 6455 framing, zero dependencies. Text + tool calls.

🎛️

Streaming Physics

Simulate realistic streaming timing with TTFT, TPS, and jitter. Test loading states and streaming UX under real-world conditions.

Fixture-driven. Zero boilerplate.

Simple text responses

Match on the last user message — substring or regex. The fixture fires when it matches, streaming SSE chunks just like the real API.

  • First-match-wins routing
  • Substring and RegExp matching
  • Configurable chunk size and latency
fixtures/chat.json json
{
  "fixtures": [
    {
      "match": { "userMessage": "stock price of AAPL" },
      "response": {
        "content": "The current stock price of Apple Inc. (AAPL) is $150.25."
      }
    },
    {
      "match": { "userMessage": "capital of France" },
      "response": {
        "content": "The capital of France is Paris."
      }
    }
  ]
}
fixtures/tools.json json
{
  "fixtures": [
    {
      "match": { "userMessage": "one step with eggs" },
      "response": {
        "toolCalls": [{
          "name": "generate_task_steps",
          "arguments": "{\"steps\":[{\"description\":\"Crack eggs\"},{\"description\":\"Preheat oven\"}]}"
        }]
      }
    },
    {
      "match": { "userMessage": "background color to blue" },
      "response": {
        "toolCalls": [{
          "name": "change_background",
          "arguments": "{\"background\":\"blue\"}"
        }]
      }
    }
  ]
}

Tool call responses

Return structured tool calls that agent frameworks execute directly. Used in production E2E tests for CopilotKit, Mastra, and LangGraph integrations.

  • Tool calls with JSON arguments
  • Match on tool name or tool result ID
  • Multi-tool-call responses

Predicate-based routing

When substring matching isn't enough, use predicates. Inspect the full request — system prompt flags, message history, model name, anything.

  • Inspect system prompt state flags
  • Route supervisor agents by conversation state
  • Combine with substring matching (AND logic)
e2e/mock-setup.ts ts
// Supervisor sees the same user message every time,
// but system prompt contains state flags
mock.addFixture({
  match: {
    predicate: (req) => {
      const sys = req.messages
        .find(m => m.role === "system");
      return sys?.content
        ?.includes("Flights found: false");
    }
  },
  response: {
    toolCalls: [{
      name: "supervisor_response",
      arguments: '{"next_agent":"flights_agent"}'
    }]
  }
});
e2e/global-setup.ts ts
import { LLMock } from "@copilotkit/llmock";

const mock = new LLMock({ port: 5555 });

// Load JSON fixture files
mock.loadFixtureDir("./fixtures/openai");

// Catch-all for tool results
mock.addFixture({
  match: {
    predicate: (req) =>
      req.messages.at(-1)?.role === "tool"
  },
  response: { content: "Done!" }
});

const url = await mock.start();

// Every process on the machine can reach this
process.env.OPENAI_BASE_URL = `${url}/v1`;
process.env.OPENAI_API_KEY = "mock-key";

E2E global setup

Start the mock server once in Playwright's global setup. All child processes — Next.js, agent workers, CopilotKit runtime — inherit OPENAI_BASE_URL and hit the same server.

  • One server, many processes
  • JSON fixtures loaded from disk
  • Programmatic catch-alls for tool results
  • Universal fallback prevents 404 crashes

WebSocket APIs

Same fixtures work over WebSocket transport. OpenAI Responses, OpenAI Realtime, and Gemini Live — RFC 6455 framing with zero dependencies.

  • OpenAI Responses API over WebSocket
  • OpenAI Realtime API — text + tool calls
  • Gemini Live BidiGenerateContent (unverified — no text-capable model exists yet)
  • No audio/video — text and tool call paths only
OpenAI Realtime over WebSocket jsonc
// Connect to ws://localhost:5555/v1/realtime

// → Configure session:
{ "type": "session.update",
  "session": { "modalities": ["text"] } }

// → Add user message:
{ "type": "conversation.item.create",
  "item": { "type": "message",
    "role": "user",
    "content": [{ "type": "input_text",
      "text": "Hello" }] } }

// → Request response:
{ "type": "response.create" }

// ← Server streams back:
// {"type":"response.created", ...}
// {"type":"response.text.delta","delta":"Hi"}
// {"type":"response.text.delta","delta":" there!"}
// {"type":"response.text.done", ...}
// {"type":"response.done", ...}

How llmock compares

llmock is purpose-built for LLM API testing. Here's how it stacks up against general-purpose and LLM-specific mocking tools.

// MSW: only intercepts in the process that calls server.listen()
// llmock: real server on a real port — any process can reach it

Playwright test runner
  └─ controls browser Next.js app (separate process)
                                    └─ OPENAI_BASE_URL llmock :5555
                                        ├─ Mastra agent workers
                                        ├─ LangGraph workers
                                        └─ CopilotKit runtime
Capability llmock MSW VidaiMock mock-llm piyook/llm-mock
Cross-process interception Real server ✓ In-process only Yes Yes (Docker) Yes
Chat Completions SSE Built-in ✓ Manual Yes Yes No
Responses API SSE Built-in ✓ Manual No No No
Claude Messages API Built-in ✓ Manual Yes No No
Gemini streaming Built-in ✓ Manual No No No
WebSocket APIs Built-in ✓ No No No No
Multi-provider support OpenAI + Claude + Gemini + compatible ✓ Manual OpenAI + Claude + Gemini + Bedrock OpenAI only OpenAI only
Embeddings API Built-in ✓ No Yes No Yes
Structured output / JSON mode Built-in ✓ Manual No No No
Sequential / stateful responses Built-in ✓ Manual No No No
Fixture files JSON ✓ Code-only Python config YAML config JSON templates
Programmatic API (test helpers) Yes (TypeScript/JS) ✓ Yes (TypeScript/JS) Yes (Python) No No
Request journal Yes ✓ Manual No No No
Error injection (one-shot) Yes ✓ Yes Partial No No
Docker image Yes ✓ No No Yes No
Helm chart Yes ✓ No No No No
Drift detection Yes ✓ No No No No
Azure OpenAI Yes ✓ Manual Yes No No
AWS Bedrock Yes (non-streaming) ✓ Manual Yes No No
CLI server Yes ✓ No No Yes Yes
GET /v1/models Yes ✓ No No Yes No
Dependencies Zero ~300KB Python + deps Docker required Minimal

Verified against real APIs. Every day.

A mock that doesn't match reality is worse than no mock — your tests pass, but production breaks. llmock runs three-way drift detection that compares SDK types, real API responses, and mock output to catch shape mismatches before you do.

SDK = Real? SDK = Mock? Real = Mock?
{ }

SDK Types

What TypeScript types say the shape should be

Real API

What OpenAI, Claude, Gemini actually return

llmock

What the mock produces for the same request

Mock doesn't match real

llmock needs updating — test fails immediately. The SDK comparison tells us why it drifted.

Provider changed, SDK is behind

Early warning — the real API has new fields that neither the SDK nor llmock know about yet.

All three agree

No drift — the mock matches reality and the SDK types are current.

$ pnpm test:drift
[critical] LLMOCK DRIFT — field in SDK + real API but missing from mock
Path:    choices[].message.refusal
SDK:     null    Real: null    Mock: <absent>
[critical] TYPE MISMATCH — real API and mock disagree on type
Path:    content[].input
SDK:     object    Real: object    Mock: string
[warning] PROVIDER ADDED FIELD — in real API but not in SDK or mock
Path:    choices[].message.annotations
SDK:     <absent>    Real: array    Mock: <absent>
2 critical (test fails) · 1 warning (logged) · detected before any user reported it

Claude Code Integration

llmock ships with a Claude Code skill that teaches your AI assistant how to write fixtures correctly — match fields, response types, agent loop patterns, gotchas, and debugging techniques.

🔌

Plugin Install

/plugin marketplace add CopilotKit/llmock
/plugin install llmock@copilotkit-tools

Skill appears as /llmock:write-fixtures

📂

Local Plugin

claude --plugin-dir ./node_modules/@copilotkit/llmock

Same result, no marketplace needed

📁

Add Directory

claude --add-dir ./node_modules/@copilotkit/llmock

Skill appears as /write-fixtures for the session

📋

Copy to Project

cp node_modules/@copilotkit/llmock/.claude/commands/write-fixtures.md .claude/commands/

Permanent /write-fixtures — commit to share with team

Real-World Usage

CopilotKit uses llmock across its test suite to verify AI agent behavior across multiple LLM providers without hitting real APIs. The tests cover streaming text, tool calls, and multi-turn conversations across both v1 and v2 runtimes. See the test suite and fixture files for real-world examples.