Gateway System

ExtendedLM Gateway is a Rust-based unified LLM gateway with OpenAI-compatible API supporting multiple providers and local inference.

What is Gateway?

Gateway provides a single API endpoint for all LLM providers (OpenAI, Anthropic, Google, xAI, OpenRouter) plus local model inference via llama.cpp and MLX. It handles request routing, response normalization, and logging automatically.

Key Features

Unified API: OpenAI-compatible Responses API for all providers
Local Inference: Built-in llama.cpp with GPU acceleration
Streaming: Server-Sent Events (SSE) support
Structured Output: JSON Schema validation
Tool Calling: Function calling across providers
Vision Support: Image input for multimodal models
Realtime API: WebSocket-based realtime communication
Request Logging: SQLite-based analytics

Supported Providers

OpenAI

GPT-4o, GPT-5 series

Anthropic

Claude Opus 4, Sonnet 4

Google

Gemini 2.5 Flash/Pro

xAI

Grok-4 Fast

OpenRouter

Multiple models via OpenRouter API

llama.cpp

GGUF models with GPU

Architecture

Components

Rust Core: High-performance HTTP server
llama.cpp: Statically linked for local inference
Provider Clients: API clients for each LLM provider
Request Router: Route requests to appropriate provider
Response Normalizer: Unify response formats
SQLite Logger: Request/response analytics

Request Flow

Client sends OpenAI-format request to Gateway
Gateway parses model key (e.g., "openai:gpt-4o")
Routes to appropriate provider or llama.cpp
Normalizes response to OpenAI format
Streams back to client via SSE
Logs request/response to SQLite

Directory Structure

apps/Gateway/
├── src/
│   ├── main.rs              # Entry point
│   ├── providers/           # Provider implementations
│   │   ├── openai.rs
│   │   ├── anthropic.rs
│   │   ├── google.rs
│   │   ├── xai.rs
│   │   └── openrouter.rs
│   ├── llama/               # llama.cpp integration
│   ├── responses_api.rs     # Responses API handler
│   └── realtime_api.rs      # Realtime WebSocket API
├── config/
│   └── lmbridge.env          # Configuration
├── Cargo.toml               # Rust dependencies
└── build.rs                 # Build script (llama.cpp)

OpenAI Provider

Supported Models

GPT-5: gpt-5, gpt-5-mini, gpt-5-nano
GPT-4o: gpt-4o, gpt-4o-mini
O-series: o1, o3-mini

Features

Tools (function calling)
Vision (image input)
Streaming
Structured outputs (JSON mode)
Response format control

Configuration

# lmbridge.env
OPENAI_API_KEY=sk-proj-...

Model Key Format

openai:gpt-4o
openai:gpt-5-mini
openai:o3-mini

Example Request

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true
}

Anthropic (Claude) Provider

Supported Models

Claude Opus 4: claude-opus-4, claude-opus-4.1
Claude Sonnet 4: claude-sonnet-4
Claude 3.7: claude-3-7-sonnet-20250219
Claude 3.5: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022

Features

Tools (function calling)
Vision (image input)
Streaming
Extended thinking (thinking blocks)
System prompts with caching

Configuration

# lmbridge.env
ANTHROPIC_API_KEY=sk-ant-...

Model Key Format

anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-3-7-sonnet-20250219

Extended Thinking

Claude models support extended thinking mode for complex reasoning:

{
  "model": "anthropic:claude-opus-4",
  "messages": [...],
  "thinking": {
    "type": "enabled",
    "budget_tokens": 10000
  }
}

Google (Gemini) Provider

Supported Models

Gemini 2.5: gemini-2.5-flash, gemini-2.5-pro
Gemini 2.0: gemini-2.0-flash-exp
Gemini 1.5: gemini-1.5-pro, gemini-1.5-flash

Features

Tools (function calling)
Vision (image/video input)
Streaming
Structured outputs (JSON mode)
Large context windows (up to 2M tokens)

Configuration

# lmbridge.env
GOOGLE_API_KEY=AIza...

Model Key Format

google:gemini-2.5-flash
google:gemini-2.5-pro

Structured Output

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "user_data",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        }
      }
    }
  }
}

xAI (Grok) Provider

Supported Models

Grok-4 Fast: grok-4-fast, grok-4-fast-no-reasoning

Features

High-speed inference
Streaming
Reasoning/no-reasoning modes

Configuration

# lmbridge.env
XAI_API_KEY=xai-...

Model Key Format

xai:grok-4-fast
xai:grok-4-fast-no-reasoning

OpenRouter Provider

Supported Models

Access hundreds of models through the OpenRouter API:

OpenAI: GPT-4o, GPT-5 series
Anthropic: Claude Opus, Sonnet
Google: Gemini models
Meta: Llama models
Mistral: Mistral, Mixtral
Many more via OpenRouter catalog

Features

Access to 100+ models through one API
Automatic model routing
Streaming support
OpenAI Responses API compatible

Configuration

# Environment variables
export OPENROUTER_API_KEY=your-key-here
export OPENROUTER_HTTP_REFERER=your-app-url  # Optional
export OPENROUTER_X_TITLE=your-app-name       # Optional

Model Key Format

openrouter:openai/gpt-4o
openrouter:anthropic/claude-3.5-sonnet
openrouter:meta-llama/llama-3-70b

Gateway llama.cpp Provider

Overview

Gateway includes built-in llama.cpp for local GGUF model inference with GPU acceleration.

Featured Models

Gemma-3-270m-it-Q4_K_M: Test model (270M params)
Gemma-3-4b-it-q4_0: Vision model (4B params)
Phi-4-mini-instruct-Q4_K_M: Instruction-tuned (14B params)
Qwen3-14B-Q8_0: Large context (14B params)
PLaMo-2-translate-Q4_0: EN/JP translation

GPU Acceleration

Gateway automatically detects CUDA and offloads layers to GPU:

# lmbridge.env
GATEWAY_GPU_LAYERS=-1  # -1 = all layers to GPU
GATEWAY_THREADS=8      # CPU threads for non-GPU parts

Configuration

# lmbridge.env
GATEWAY_MODEL_DIR=/path/to/models
GATEWAY_GPU_LAYERS=-1
GATEWAY_THREADS=8
GATEWAY_TOP_K=40
GATEWAY_TOP_P=0.9
GATEWAY_STRUCTURED_MODE=auto  # auto, strict, prompt, off
GATEWAY_STRUCTURED_PLAN=auto  # auto, system, off

Model Key Format

lmbridge:phi-4-mini-instruct
lmbridge:qwen3-14b
lmbridge:plamo-2-translate

Vision Support

For vision models, place mmproj file alongside the model:

models/llama/
├── Gemma-3-4b-it-q4_0.gguf
└── Gemma-3-4b-it-mmproj-q4_0.gguf

Vision Request Example

{
  "model": "lmbridge:gemma-3-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,..."
          }
        }
      ]
    }
  ]
}

Responses API

Endpoint: POST /v1/responses

OpenAI-compatible text generation API.

Request Format

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1000,
  "tools": [...],
  "tool_choice": "auto"
}

Response Format (Non-Streaming)

{
  "id": "resp_123",
  "object": "response",
  "created": 1234567890,
  "model": "openai:gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Streaming Response (SSE)

When stream: true, Gateway returns Server-Sent Events:

data: {"type":"response.start","response":{"id":"resp_123"}}

data: {"type":"content.delta","delta":{"text":"Hello"}}

data: {"type":"content.delta","delta":{"text":"!"}}

data: {"type":"response.done","response":{...}}

Tool Calling

{
  "model": "openai:gpt-4o",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": { "type": "string" }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Response with tool call:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "call_123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\"}"
            }
          }
        ]
      }
    }
  ]
}

Realtime API (WebSocket)

Endpoint: ws://localhost:8080/v1/realtime

WebSocket-based realtime communication with LLMs.

Connection

const ws = new WebSocket('ws://localhost:8080/v1/realtime',
  ['openai-realtime-v1'])

ws.onopen = () => {
  console.log('Connected')
}

ws.onmessage = (event) => {
  const data = JSON.parse(event.data)
  console.log('Received:', data)
}

Send Message

ws.send(JSON.stringify({
  type: 'response.create',
  response: {
    modalities: ['text'],
    instructions: 'You are a helpful assistant.',
    input: [
      {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: 'Hello!'
          }
        ]
      }
    ]
  }
}))

Receive Events

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data)

  switch (msg.type) {
    case 'response.text.delta':
      console.log('Text:', msg.delta)
      break
    case 'response.done':
      console.log('Done')
      break
  }
}

Service Control API

Get Status

Endpoint: GET /v1/service/status

{
  "status": "ok",
  "cache": {
    "enabled": true,
    "entries": 42
  },
  "uptime_seconds": 3600
}

Stop Current Inference

Endpoint: POST /v1/service/stop

Cancels currently running inference.

Shutdown Gateway

Endpoint: POST /v1/service/shutdown

Gracefully shuts down the Gateway server.

Model Management (llama.cpp)

List Loaded Models

Endpoint: GET /v1/local/models

{
  "models": [
    {
      "name": "phi-4-mini-instruct",
      "path": "/models/Phi-4-mini-instruct-Q4_K_M.gguf",
      "loaded": true,
      "gpu_layers": 40
    }
  ]
}

Preload Model

Endpoint: POST /v1/local/models/load

{
  "model": "phi-4-mini-instruct",
  "gpu_layers": -1
}

Unload Model

Endpoint: POST /v1/local/models/unload

{
  "model": "phi-4-mini-instruct"
}

GPU Acceleration

CUDA Support

Gateway automatically detects CUDA and uses GPU for llama.cpp inference.

Configuration

# lmbridge.env
GATEWAY_GPU_LAYERS=-1  # -1 = all layers, 0 = CPU only, N = N layers to GPU

Check GPU Usage

# Terminal 1: Start Gateway
cargo run --release --bin Gateway

# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi

Performance

GPU acceleration provides significant speedup:

CPU only: ~10-20 tokens/sec (14B model)
GPU (RTX 3090): ~80-100 tokens/sec (14B model)
GPU (RTX 4090): ~120-150 tokens/sec (14B model)

Structured Output

JSON Mode

Force model to output valid JSON:

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Extract name and age from: John is 30 years old"
    }
  ],
  "response_format": {
    "type": "json_object"
  }
}

JSON Schema (Google/OpenAI)

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "person",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        },
        "required": ["name", "age"],
        "additionalProperties": false
      }
    }
  }
}

llama.cpp Structured Mode

Gateway llama.cpp supports structured output via grammar:

# lmbridge.env
GATEWAY_STRUCTURED_MODE=auto  # auto, strict, prompt, off
GATEWAY_STRUCTURED_PLAN=auto  # auto, system, off

Tool Calling

Define Tools

{
  "model": "anthropic:claude-sonnet-4",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_web",
        "description": "Search the web",
        "parameters": {
          "type": "object",
          "properties": {
            "query": {
              "type": "string",
              "description": "Search query"
            }
          },
          "required": ["query"]
        }
      }
    }
  ]
}

Tool Choice

"auto" - Model decides
"none" - Never use tools
{"type":"function","function":{"name":"search_web"}} - Force specific tool

Handle Tool Calls

// 1. Send request with tools
const response = await fetch('/v1/responses', {
  method: 'POST',
  body: JSON.stringify({ model, messages, tools })
})

// 2. Check for tool calls
if (response.choices[0].message.tool_calls) {
  const toolCall = response.choices[0].message.tool_calls[0]

  // 3. Execute tool
  const result = await executeToolLocally(toolCall)

  // 4. Send result back to model
  messages.push({
    role: 'tool',
    tool_call_id: toolCall.id,
    content: JSON.stringify(result)
  })

  // 5. Get final response
  const finalResponse = await fetch('/v1/responses', {
    method: 'POST',
    body: JSON.stringify({ model, messages, tools })
  })
}

Gateway Configuration

File: apps/Gateway/config/gateway.env

Full Configuration

# Server
GATEWAY_HOST=0.0.0.0
GATEWAY_PORT=8080

# API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
XAI_API_KEY=xai-...

# OpenRouter
OPENROUTER_API_KEY=your-key-here

# llama.cpp (local models)
GATEWAY_MODEL_DIR=/path/to/models
GATEWAY_GPU_LAYERS=-1
GATEWAY_TOP_K=40
GATEWAY_TOP_P=0.9
GATEWAY_STRUCTURED_MODE=auto
GATEWAY_STRUCTURED_PLAN=auto

# Logging
LOG_LEVEL=info
SQLITE_DB_PATH=./lmbridge.sqlite

# Cache
CACHE_ENABLED=true
CACHE_TTL_SECONDS=300

Build

Build & Run

Run

# Clone repository (URL will be provided)
cd ExtendedLM/apps/Gateway

# Set model directory
export GATEWAY_MODEL_DIR=$PWD/../../models/llama

# Build (downloads llama.cpp automatically)
cargo build --release

# Run
cargo run --release --bin Gateway

Docker (Alternative)

# Build Docker image
docker build -t lmbridge .

# Run container
docker run -p 8080:8080 \
  -v ./models:/models \
  -e GATEWAY_MODEL_DIR=/models \
  lmbridge