Real HTTP server. Real SSE streams. WebSocket APIs. Fixture-driven responses. Multi-provider mock — OpenAI, Claude, Gemini — any process on the machine can reach it.
npm install @copilotkit/llmock
{
"fixtures": [
{
"match": {
"userMessage": "capital of France"
},
"response": {
"content": "The capital of France is Paris."
}
}
]
}
Tests that hit real LLM APIs — OpenAI, Gemini, Anthropic — cost money, time out, and produce non-deterministic results. llmock replaces those calls with immediate, deterministic responses from a real HTTP server any process on the machine can reach.
Runs on an actual port. Any process on the machine can reach it — Next.js, Mastra, LangGraph, Agno, anything that speaks HTTP.
OpenAI, Claude, and Gemini APIs — authentic SSE format for each provider. Streaming and non-streaming modes.
Define responses as JSON — one file per feature. Load a directory, load a file, or register fixtures programmatically.
Return tool calls with structured arguments. Match on tool names, tool result IDs, or write custom predicates.
Queue one-shot errors — 429 rate limits, 503 outages, whatever. Fires once, then auto-removes itself.
Every request recorded. Inspect messages, verify tool calls, assert on conversation history. HTTP and programmatic access.
OpenAI Responses, OpenAI Realtime, and Gemini Live over WebSocket. Same fixtures, real RFC 6455 framing, zero dependencies. Text + tool calls.
Simulate realistic streaming timing with TTFT, TPS, and jitter. Test loading states and streaming UX under real-world conditions.
Match on the last user message — substring or regex. The fixture fires when it matches, streaming SSE chunks just like the real API.
{
"fixtures": [
{
"match": { "userMessage": "stock price of AAPL" },
"response": {
"content": "The current stock price of Apple Inc. (AAPL) is $150.25."
}
},
{
"match": { "userMessage": "capital of France" },
"response": {
"content": "The capital of France is Paris."
}
}
]
}
{
"fixtures": [
{
"match": { "userMessage": "one step with eggs" },
"response": {
"toolCalls": [{
"name": "generate_task_steps",
"arguments": "{\"steps\":[{\"description\":\"Crack eggs\"},{\"description\":\"Preheat oven\"}]}"
}]
}
},
{
"match": { "userMessage": "background color to blue" },
"response": {
"toolCalls": [{
"name": "change_background",
"arguments": "{\"background\":\"blue\"}"
}]
}
}
]
}
Return structured tool calls that agent frameworks execute directly. Used in production E2E tests for CopilotKit, Mastra, and LangGraph integrations.
When substring matching isn't enough, use predicates. Inspect the full request — system prompt flags, message history, model name, anything.
// Supervisor sees the same user message every time,
// but system prompt contains state flags
mock.addFixture({
match: {
predicate: (req) => {
const sys = req.messages
.find(m => m.role === "system");
return sys?.content
?.includes("Flights found: false");
}
},
response: {
toolCalls: [{
name: "supervisor_response",
arguments: '{"next_agent":"flights_agent"}'
}]
}
});
import { LLMock } from "@copilotkit/llmock";
const mock = new LLMock({ port: 5555 });
// Load JSON fixture files
mock.loadFixtureDir("./fixtures/openai");
// Catch-all for tool results
mock.addFixture({
match: {
predicate: (req) =>
req.messages.at(-1)?.role === "tool"
},
response: { content: "Done!" }
});
const url = await mock.start();
// Every process on the machine can reach this
process.env.OPENAI_BASE_URL = `${url}/v1`;
process.env.OPENAI_API_KEY = "mock-key";
Start the mock server once in Playwright's global setup. All child processes —
Next.js, agent workers, CopilotKit runtime — inherit OPENAI_BASE_URL and
hit the same server.
Same fixtures work over WebSocket transport. OpenAI Responses, OpenAI Realtime, and Gemini Live — RFC 6455 framing with zero dependencies.
// Connect to ws://localhost:5555/v1/realtime
// → Configure session:
{ "type": "session.update",
"session": { "modalities": ["text"] } }
// → Add user message:
{ "type": "conversation.item.create",
"item": { "type": "message",
"role": "user",
"content": [{ "type": "input_text",
"text": "Hello" }] } }
// → Request response:
{ "type": "response.create" }
// ← Server streams back:
// {"type":"response.created", ...}
// {"type":"response.text.delta","delta":"Hi"}
// {"type":"response.text.delta","delta":" there!"}
// {"type":"response.text.done", ...}
// {"type":"response.done", ...}
llmock is purpose-built for LLM API testing. Here's how it stacks up against general-purpose and LLM-specific mocking tools.
| Capability | llmock | MSW | VidaiMock | mock-llm | piyook/llm-mock |
|---|---|---|---|---|---|
| Cross-process interception | Real server ✓ | In-process only | Yes | Yes (Docker) | Yes |
| Chat Completions SSE | Built-in ✓ | Manual | Yes | Yes | No |
| Responses API SSE | Built-in ✓ | Manual | No | No | No |
| Claude Messages API | Built-in ✓ | Manual | Yes | No | No |
| Gemini streaming | Built-in ✓ | Manual | No | No | No |
| WebSocket APIs | Built-in ✓ | No | No | No | No |
| Multi-provider support | OpenAI + Claude + Gemini + compatible ✓ | Manual | OpenAI + Claude + Gemini + Bedrock | OpenAI only | OpenAI only |
| Embeddings API | Built-in ✓ | No | Yes | No | Yes |
| Structured output / JSON mode | Built-in ✓ | Manual | No | No | No |
| Sequential / stateful responses | Built-in ✓ | Manual | No | No | No |
| Fixture files | JSON ✓ | Code-only | Python config | YAML config | JSON templates |
| Programmatic API (test helpers) | Yes (TypeScript/JS) ✓ | Yes (TypeScript/JS) | Yes (Python) | No | No |
| Request journal | Yes ✓ | Manual | No | No | No |
| Error injection (one-shot) | Yes ✓ | Yes | Partial | No | No |
| Docker image | Yes ✓ | No | No | Yes | No |
| Helm chart | Yes ✓ | No | No | No | No |
| Drift detection | Yes ✓ | No | No | No | No |
| Azure OpenAI | Yes ✓ | Manual | Yes | No | No |
| AWS Bedrock | Yes (non-streaming) ✓ | Manual | Yes | No | No |
| CLI server | Yes ✓ | No | No | Yes | Yes |
| GET /v1/models | Yes ✓ | No | No | Yes | No |
| Dependencies | Zero | ~300KB | Python + deps | Docker required | Minimal |
A mock that doesn't match reality is worse than no mock — your tests pass, but production breaks. llmock runs three-way drift detection that compares SDK types, real API responses, and mock output to catch shape mismatches before you do.
What TypeScript types say the shape should be
What OpenAI, Claude, Gemini actually return
What the mock produces for the same request
llmock needs updating — test fails immediately. The SDK comparison tells us why it drifted.
Early warning — the real API has new fields that neither the SDK nor llmock know about yet.
No drift — the mock matches reality and the SDK types are current.
llmock ships with a Claude Code skill that teaches your AI assistant how to write fixtures correctly — match fields, response types, agent loop patterns, gotchas, and debugging techniques.
/plugin marketplace add CopilotKit/llmock
/plugin install llmock@copilotkit-tools
Skill appears as /llmock:write-fixtures
claude --plugin-dir ./node_modules/@copilotkit/llmock
Same result, no marketplace needed
claude --add-dir ./node_modules/@copilotkit/llmock
Skill appears as /write-fixtures for the session
cp node_modules/@copilotkit/llmock/.claude/commands/write-fixtures.md
.claude/commands/
Permanent /write-fixtures — commit to share with team
CopilotKit uses llmock across its test suite to verify AI agent behavior across multiple LLM providers without hitting real APIs. The tests cover streaming text, tool calls, and multi-turn conversations across both v1 and v2 runtimes. See the test suite and fixture files for real-world examples.