Ollama

llmock implements Ollama's native /api/chat, /api/generate, and /api/tags endpoints with NDJSON streaming, matching Ollama's wire format including its key differences from OpenAI.

Endpoints

Method	Path	Description
POST	/api/chat	Chat completions (multi-turn, tool calls)
POST	/api/generate	Single-prompt text generation (no tool calls)
GET	/api/tags	List available models (derived from fixtures)

Key Differences from OpenAI

Defaults to streaming. Ollama treats stream as true when absent — the opposite of OpenAI. Set "stream": false explicitly for non-streaming responses.
NDJSON, not SSE. Streaming uses newline-delimited JSON, not Server-Sent Events.
Tool call arguments are objects. Unlike OpenAI which sends stringified JSON, Ollama sends parsed objects in arguments.
No tool call IDs. Ollama tool calls have no id field.
Duration metadata. Responses include done_reason, total_duration, eval_count, etc. on the final chunk. llmock sends zeroed values.

Quick Start

ollama-quick-start.ts ts

import { LLMock } from "@copilotkit/llmock";

const mock = new LLMock();
mock.onMessage("hello", { content: "Hi from Ollama!" });
await mock.start();

// Point the Ollama SDK at llmock
const res = await fetch(`${mock.url}/api/chat`, {
  method: "POST",
  body: JSON.stringify({
    model: "llama3",
    messages: [{ role: "user", content: "hello" }],
    stream: false,
  }),
});

Streaming Response Format (NDJSON)

When stream is true (the default), each line is a complete JSON object separated by newlines:

/api/chat streaming output ndjson

{"model":"llama3","message":{"role":"assistant","content":"Hi"},"done":false}
{"model":"llama3","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3","message":{"role":"assistant","content":""},"done":true,"done_reason":"stop","total_duration":0,"load_duration":0,"prompt_eval_count":0,"prompt_eval_duration":0,"eval_count":0,"eval_duration":0}

Non-Streaming Response

/api/chat non-streaming output json

{
  "model": "llama3",
  "message": { "role": "assistant", "content": "Hi there!" },
  "done": true,
  "done_reason": "stop",
  "total_duration": 0,
  "load_duration": 0,
  "prompt_eval_count": 0,
  "prompt_eval_duration": 0,
  "eval_count": 0,
  "eval_duration": 0
}

Tool Calls

Tool calls in Ollama send arguments as a parsed object (not a JSON string). llmock automatically converts fixture arguments strings into objects for the Ollama wire format.

ollama-tool-call-fixture.json json

{
  "fixtures": [
    {
      "match": { "userMessage": "weather" },
      "response": {
        "toolCalls": [
          { "name": "get_weather", "arguments": "{\"city\":\"NYC\"}" }
        ]
      }
    }
  ]
}

The Ollama streaming response wraps tool calls in a single chunk:

Tool call NDJSON output ndjson

{"model":"llama3","message":{"role":"assistant","content":"","tool_calls":[{"function":{"name":"get_weather","arguments":{"city":"NYC"}}}]},"done":false}
{"model":"llama3","message":{"role":"assistant","content":""},"done":true,"done_reason":"stop","total_duration":0,"load_duration":0,"prompt_eval_count":0,"prompt_eval_duration":0,"eval_count":0,"eval_duration":0}

/api/generate Endpoint

The /api/generate endpoint takes a prompt string instead of a messages array. The prompt is internally converted to a single user message for fixture matching. Only text responses are supported (no tool calls).

/api/generate request json

{
  "model": "llama3",
  "prompt": "Tell me a joke",
  "stream": false
}

/api/generate response json

{
  "model": "llama3",
  "created_at": "2025-01-01T00:00:00.000Z",
  "response": "Why did the chicken cross the road?",
  "done": true,
  "done_reason": "stop",
  "total_duration": 0,
  "load_duration": 0,
  "prompt_eval_count": 0,
  "prompt_eval_duration": 0,
  "eval_count": 0,
  "eval_duration": 0,
  "context": []
}

/api/tags Endpoint

GET /api/tags returns a list of available models, derived from the model fields across all loaded fixtures. This lets Ollama clients discover which models the mock server supports.

Request Translation

llmock internally translates Ollama requests to a unified ChatCompletionRequest format for fixture matching. The ollamaToCompletionRequest() function maps Ollama's options.temperature to temperature and options.num_predict to max_tokens, so the same fixtures work across all providers.