Ollama

llmock implements Ollama's native /api/chat, /api/generate, and /api/tags endpoints with NDJSON streaming, matching Ollama's wire format including its key differences from OpenAI.

Endpoints

Method Path Description
POST /api/chat Chat completions (multi-turn, tool calls)
POST /api/generate Single-prompt text generation (no tool calls)
GET /api/tags List available models (derived from fixtures)

Key Differences from OpenAI

Quick Start

ollama-quick-start.ts ts
import { LLMock } from "@copilotkit/llmock";

const mock = new LLMock();
mock.onMessage("hello", { content: "Hi from Ollama!" });
await mock.start();

// Point the Ollama SDK at llmock
const res = await fetch(`${mock.url}/api/chat`, {
  method: "POST",
  body: JSON.stringify({
    model: "llama3",
    messages: [{ role: "user", content: "hello" }],
    stream: false,
  }),
});

Streaming Response Format (NDJSON)

When stream is true (the default), each line is a complete JSON object separated by newlines:

/api/chat streaming output ndjson
{"model":"llama3","message":{"role":"assistant","content":"Hi"},"done":false}
{"model":"llama3","message":{"role":"assistant","content":" there"},"done":false}
{"model":"llama3","message":{"role":"assistant","content":""},"done":true,"done_reason":"stop","total_duration":0,"load_duration":0,"prompt_eval_count":0,"prompt_eval_duration":0,"eval_count":0,"eval_duration":0}

Non-Streaming Response

/api/chat non-streaming output json
{
  "model": "llama3",
  "message": { "role": "assistant", "content": "Hi there!" },
  "done": true,
  "done_reason": "stop",
  "total_duration": 0,
  "load_duration": 0,
  "prompt_eval_count": 0,
  "prompt_eval_duration": 0,
  "eval_count": 0,
  "eval_duration": 0
}

Tool Calls

Tool calls in Ollama send arguments as a parsed object (not a JSON string). llmock automatically converts fixture arguments strings into objects for the Ollama wire format.

ollama-tool-call-fixture.json json
{
  "fixtures": [
    {
      "match": { "userMessage": "weather" },
      "response": {
        "toolCalls": [
          { "name": "get_weather", "arguments": "{\"city\":\"NYC\"}" }
        ]
      }
    }
  ]
}

The Ollama streaming response wraps tool calls in a single chunk:

Tool call NDJSON output ndjson
{"model":"llama3","message":{"role":"assistant","content":"","tool_calls":[{"function":{"name":"get_weather","arguments":{"city":"NYC"}}}]},"done":false}
{"model":"llama3","message":{"role":"assistant","content":""},"done":true,"done_reason":"stop","total_duration":0,"load_duration":0,"prompt_eval_count":0,"prompt_eval_duration":0,"eval_count":0,"eval_duration":0}

/api/generate Endpoint

The /api/generate endpoint takes a prompt string instead of a messages array. The prompt is internally converted to a single user message for fixture matching. Only text responses are supported (no tool calls).

/api/generate request json
{
  "model": "llama3",
  "prompt": "Tell me a joke",
  "stream": false
}
/api/generate response json
{
  "model": "llama3",
  "created_at": "2025-01-01T00:00:00.000Z",
  "response": "Why did the chicken cross the road?",
  "done": true,
  "done_reason": "stop",
  "total_duration": 0,
  "load_duration": 0,
  "prompt_eval_count": 0,
  "prompt_eval_duration": 0,
  "eval_count": 0,
  "eval_duration": 0,
  "context": []
}

/api/tags Endpoint

GET /api/tags returns a list of available models, derived from the model fields across all loaded fixtures. This lets Ollama clients discover which models the mock server supports.

Request Translation

llmock internally translates Ollama requests to a unified ChatCompletionRequest format for fixture matching. The ollamaToCompletionRequest() function maps Ollama's options.temperature to temperature and options.num_predict to max_tokens, so the same fixtures work across all providers.