Gateway System

ExtendedLM Gateway is a Rust-based unified LLM gateway with OpenAI-compatible API supporting multiple providers and local inference.

Gateway Overview
What is Gateway?

Gateway provides a single API endpoint for all LLM providers (OpenAI, Anthropic, Google, xAI, Ollama) plus local model inference via llama.cpp. It handles request routing, response normalization, and caching automatically.

Key Features

  • Unified API: OpenAI-compatible Responses API for all providers
  • Local Inference: Built-in llama.cpp with GPU acceleration
  • Streaming: Server-Sent Events (SSE) support
  • Structured Output: JSON Schema validation
  • Tool Calling: Function calling across providers
  • Vision Support: Image input for multimodal models
  • Realtime API: WebSocket-based realtime communication
  • Request Logging: SQLite-based analytics

Supported Providers

OpenAI

GPT-4o, GPT-5 series

Anthropic

Claude Opus 4, Sonnet 4

Google

Gemini 2.5 Flash/Pro

xAI

Grok-4 Fast

Ollama

Local models (Phi4, Qwen3, etc.)

llama.cpp

GGUF models with GPU

Architecture

Gateway Architecture

Components

  • Rust Core: High-performance HTTP server
  • llama.cpp: Statically linked for local inference
  • Provider Clients: API clients for each LLM provider
  • Request Router: Route requests to appropriate provider
  • Response Normalizer: Unify response formats
  • SQLite Logger: Request/response analytics

Request Flow

  1. Client sends OpenAI-format request to Gateway
  2. Gateway parses model key (e.g., "openai:gpt-4o")
  3. Routes to appropriate provider or llama.cpp
  4. Normalizes response to OpenAI format
  5. Streams back to client via SSE
  6. Logs request/response to SQLite

Directory Structure

apps/Gateway/
├── src/
│   ├── main.rs              # Entry point
│   ├── providers/           # Provider implementations
│   │   ├── openai.rs
│   │   ├── anthropic.rs
│   │   ├── google.rs
│   │   ├── xai.rs
│   │   └── ollama.rs
│   ├── llama/               # llama.cpp integration
│   ├── responses_api.rs     # Responses API handler
│   └── realtime_api.rs      # Realtime WebSocket API
├── config/
│   └── lmbridge.env          # Configuration
├── Cargo.toml               # Rust dependencies
└── build.rs                 # Build script (llama.cpp)

OpenAI Provider

Supported Models

  • GPT-5: gpt-5, gpt-5-mini, gpt-5-nano
  • GPT-4o: gpt-4o, gpt-4o-mini
  • O-series: o1, o3-mini

Features

  • Tools (function calling)
  • Vision (image input)
  • Streaming
  • Structured outputs (JSON mode)
  • Response format control

Configuration

# lmbridge.env
OPENAI_API_KEY=sk-proj-...

Model Key Format

openai:gpt-4o
openai:gpt-5-mini
openai:o3-mini

Example Request

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true
}

Anthropic (Claude) Provider

Supported Models

  • Claude Opus 4: claude-opus-4, claude-opus-4.1
  • Claude Sonnet 4: claude-sonnet-4
  • Claude 3.7: claude-3-7-sonnet-20250219
  • Claude 3.5: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022

Features

  • Tools (function calling)
  • Vision (image input)
  • Streaming
  • Extended thinking (thinking blocks)
  • System prompts with caching

Configuration

# lmbridge.env
ANTHROPIC_API_KEY=sk-ant-...

Model Key Format

anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-3-7-sonnet-20250219

Extended Thinking

Claude models support extended thinking mode for complex reasoning:

{
  "model": "anthropic:claude-opus-4",
  "messages": [...],
  "thinking": {
    "type": "enabled",
    "budget_tokens": 10000
  }
}

Google (Gemini) Provider

Supported Models

  • Gemini 2.5: gemini-2.5-flash, gemini-2.5-pro
  • Gemini 2.0: gemini-2.0-flash-exp
  • Gemini 1.5: gemini-1.5-pro, gemini-1.5-flash

Features

  • Tools (function calling)
  • Vision (image/video input)
  • Streaming
  • Structured outputs (JSON mode)
  • Large context windows (up to 2M tokens)

Configuration

# lmbridge.env
GOOGLE_API_KEY=AIza...

Model Key Format

google:gemini-2.5-flash
google:gemini-2.5-pro

Structured Output

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "user_data",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        }
      }
    }
  }
}

xAI (Grok) Provider

Supported Models

  • Grok-4 Fast: grok-4-fast, grok-4-fast-no-reasoning

Features

  • High-speed inference
  • Streaming
  • Reasoning/no-reasoning modes

Configuration

# lmbridge.env
XAI_API_KEY=xai-...

Model Key Format

xai:grok-4-fast
xai:grok-4-fast-no-reasoning

Ollama Provider

Supported Models

Any model available in your Ollama instance:

  • Phi4: phi4
  • Qwen3: qwen3-30b, qwen3-32b, qwen3-235b
  • Gemma3: gemma3-27b
  • Qwen2.5-coder: qwen2.5-coder-32b
  • Deepseek-r1: deepseek-r1-671b
  • Llama4: llama4-128x17b
  • Many more...

Features

  • Self-hosted models
  • Full privacy
  • Customizable
  • Streaming
  • Vision models (llava, etc.)

Configuration

# lmbridge.env
OLLAMA_BASE_URL=http://localhost:11434

Model Key Format

ollama:phi4
ollama:qwen3-30b
ollama:deepseek-r1-671b

Setup Ollama

# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh

# Pull models
ollama pull phi4
ollama pull qwen3-30b

# Start server
ollama serve

Gateway llama.cpp Provider

Overview

Gateway includes built-in llama.cpp for local GGUF model inference with GPU acceleration.

Featured Models

  • Gemma-3-270m-it-Q4_K_M: Test model (270M params)
  • Gemma-3-4b-it-q4_0: Vision model (4B params)
  • Phi-4-mini-instruct-Q4_K_M: Instruction-tuned (14B params)
  • Qwen3-14B-Q8_0: Large context (14B params)
  • PLaMo-2-translate-Q4_0: EN/JP translation

GPU Acceleration

Gateway automatically detects CUDA and offloads layers to GPU:

# lmbridge.env
LLAMA_GPU_LAYERS=-1  # -1 = all layers to GPU
LLAMA_THREADS=8      # CPU threads for non-GPU parts

Configuration

# lmbridge.env
LLAMA_MODEL_DIR=/path/to/models
LLAMA_GPU_LAYERS=-1
LLAMA_THREADS=8
LLAMA_TOP_K=40
LLAMA_TOP_P=0.9
LLAMA_STRUCTURED_MODE=auto  # auto, strict, prompt, off
LLAMA_STRUCTURED_PLAN=auto  # auto, system, off

Model Key Format

lmbridge:phi-4-mini-instruct
lmbridge:qwen3-14b
lmbridge:plamo-2-translate

Vision Support

For vision models, place mmproj file alongside the model:

models/llama/
├── Gemma-3-4b-it-q4_0.gguf
└── Gemma-3-4b-it-mmproj-q4_0.gguf

Vision Request Example

{
  "model": "lmbridge:gemma-3-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,..."
          }
        }
      ]
    }
  ]
}

Responses API

Endpoint: POST /v1/responses

OpenAI-compatible text generation API.

Request Format

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1000,
  "tools": [...],
  "tool_choice": "auto"
}

Response Format (Non-Streaming)

{
  "id": "resp_123",
  "object": "response",
  "created": 1234567890,
  "model": "openai:gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Streaming Response (SSE)

When stream: true, Gateway returns Server-Sent Events:

data: {"type":"response.start","response":{"id":"resp_123"}}

data: {"type":"content.delta","delta":{"text":"Hello"}}

data: {"type":"content.delta","delta":{"text":"!"}}

data: {"type":"response.done","response":{...}}

Tool Calling

{
  "model": "openai:gpt-4o",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": { "type": "string" }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Response with tool call:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "call_123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\"}"
            }
          }
        ]
      }
    }
  ]
}

Realtime API (WebSocket)

Endpoint: ws://localhost:8080/v1/realtime

WebSocket-based realtime communication with LLMs.

Connection

const ws = new WebSocket('ws://localhost:8080/v1/realtime',
  ['openai-realtime-v1'])

ws.onopen = () => {
  console.log('Connected')
}

ws.onmessage = (event) => {
  const data = JSON.parse(event.data)
  console.log('Received:', data)
}

Send Message

ws.send(JSON.stringify({
  type: 'response.create',
  response: {
    modalities: ['text'],
    instructions: 'You are a helpful assistant.',
    input: [
      {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: 'Hello!'
          }
        ]
      }
    ]
  }
}))

Receive Events

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data)

  switch (msg.type) {
    case 'response.text.delta':
      console.log('Text:', msg.delta)
      break
    case 'response.done':
      console.log('Done')
      break
  }
}

Service Control API

Get Status

Endpoint: GET /v1/service/status

{
  "status": "ok",
  "cache": {
    "enabled": true,
    "entries": 42
  },
  "uptime_seconds": 3600
}

Stop Current Inference

Endpoint: POST /v1/service/stop

Cancels currently running inference.

Shutdown Gateway

Endpoint: POST /v1/service/shutdown

Gracefully shuts down the Gateway server.

Model Management (llama.cpp)

List Loaded Models

Endpoint: GET /v1/llama/models

{
  "models": [
    {
      "name": "phi-4-mini-instruct",
      "path": "/models/Phi-4-mini-instruct-Q4_K_M.gguf",
      "loaded": true,
      "gpu_layers": 40
    }
  ]
}

Preload Model

Endpoint: POST /v1/llama/models/load

{
  "model": "phi-4-mini-instruct",
  "gpu_layers": -1
}

Unload Model

Endpoint: POST /v1/llama/models/unload

{
  "model": "phi-4-mini-instruct"
}

GPU Acceleration

CUDA Support

Gateway automatically detects CUDA and uses GPU for llama.cpp inference.

Configuration

# lmbridge.env
LLAMA_GPU_LAYERS=-1  # -1 = all layers, 0 = CPU only, N = N layers to GPU

Check GPU Usage

# Terminal 1: Start Gateway
cargo run --release --bin Gateway

# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi

Performance

GPU acceleration provides significant speedup:

  • CPU only: ~10-20 tokens/sec (14B model)
  • GPU (RTX 3090): ~80-100 tokens/sec (14B model)
  • GPU (RTX 4090): ~120-150 tokens/sec (14B model)

Structured Output

JSON Mode

Force model to output valid JSON:

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Extract name and age from: John is 30 years old"
    }
  ],
  "response_format": {
    "type": "json_object"
  }
}

JSON Schema (Google/OpenAI)

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "person",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        },
        "required": ["name", "age"],
        "additionalProperties": false
      }
    }
  }
}

llama.cpp Structured Mode

Gateway llama.cpp supports structured output via grammar:

# lmbridge.env
LLAMA_STRUCTURED_MODE=auto  # auto, strict, prompt, off
LLAMA_STRUCTURED_PLAN=auto  # auto, system, off

Tool Calling

Define Tools

{
  "model": "anthropic:claude-sonnet-4",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_web",
        "description": "Search the web",
        "parameters": {
          "type": "object",
          "properties": {
            "query": {
              "type": "string",
              "description": "Search query"
            }
          },
          "required": ["query"]
        }
      }
    }
  ]
}

Tool Choice

  • "auto" - Model decides
  • "none" - Never use tools
  • {"type":"function","function":{"name":"search_web"}} - Force specific tool

Handle Tool Calls

// 1. Send request with tools
const response = await fetch('/v1/responses', {
  method: 'POST',
  body: JSON.stringify({ model, messages, tools })
})

// 2. Check for tool calls
if (response.choices[0].message.tool_calls) {
  const toolCall = response.choices[0].message.tool_calls[0]

  // 3. Execute tool
  const result = await executeToolLocally(toolCall)

  // 4. Send result back to model
  messages.push({
    role: 'tool',
    tool_call_id: toolCall.id,
    content: JSON.stringify(result)
  })

  // 5. Get final response
  const finalResponse = await fetch('/v1/responses', {
    method: 'POST',
    body: JSON.stringify({ model, messages, tools })
  })
}

Gateway Configuration

File: apps/Gateway/config/gateway.env

Full Configuration

# Server
GATEWAY_HOST=0.0.0.0
GATEWAY_PORT=8080

# API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
XAI_API_KEY=xai-...

# Ollama
OLLAMA_BASE_URL=http://localhost:11434

# llama.cpp
LLAMA_MODEL_DIR=/path/to/models/llama
LLAMA_GPU_LAYERS=-1
LLAMA_THREADS=8
LLAMA_TOP_K=40
LLAMA_TOP_P=0.9
LLAMA_STRUCTURED_MODE=auto
LLAMA_STRUCTURED_PLAN=auto

# Logging
LOG_LEVEL=info
SQLITE_DB_PATH=./lmbridge.sqlite

# Cache
CACHE_ENABLED=true
CACHE_TTL_SECONDS=300

Build

Build & Run

Run
# Clone repository (URL will be provided)
cd ExtendedLM/apps/Gateway

# Set model directory
export LLAMA_MODEL_DIR=$PWD/../../models/llama

# Build (downloads llama.cpp automatically)
cargo build --release

# Run
cargo run --release --bin Gateway

Docker (Alternative)

# Build Docker image
docker build -t lmbridge .

# Run container
docker run -p 8080:8080 \
  -v ./models:/models \
  -e LLAMA_MODEL_DIR=/models \
  lmbridge