Gateway System

ExtendedLM Gateway is a Rust-based unified LLM gateway with OpenAI-compatible API supporting multiple providers and local inference.

Gateway Overview
What is Gateway?

Gateway provides a single API endpoint for all LLM providers (OpenAI, Anthropic, Google, xAI, OpenRouter) plus local model inference via llama.cpp and MLX. It handles request routing, response normalization, and logging automatically.

Key Features

  • Unified API: OpenAI-compatible Responses API for all providers
  • Local Inference: Built-in llama.cpp with GPU acceleration
  • Streaming: Server-Sent Events (SSE) support
  • Structured Output: JSON Schema validation
  • Tool Calling: Function calling across providers
  • Vision Support: Image input for multimodal models
  • Realtime API: WebSocket-based realtime communication
  • Request Logging: SQLite-based analytics

Supported Providers

OpenAI

GPT-4o, GPT-5 series

Anthropic

Claude Opus 4, Sonnet 4

Google

Gemini 2.5 Flash/Pro

xAI

Grok-4 Fast

OpenRouter

Multiple models via OpenRouter API

llama.cpp

GGUF models with GPU

Architecture

Gateway Architecture

Components

  • Rust Core: High-performance HTTP server
  • llama.cpp: Statically linked for local inference
  • Provider Clients: API clients for each LLM provider
  • Request Router: Route requests to appropriate provider
  • Response Normalizer: Unify response formats
  • SQLite Logger: Request/response analytics

Request Flow

  1. Client sends OpenAI-format request to Gateway
  2. Gateway parses model key (e.g., "openai:gpt-4o")
  3. Routes to appropriate provider or llama.cpp
  4. Normalizes response to OpenAI format
  5. Streams back to client via SSE
  6. Logs request/response to SQLite

Directory Structure

apps/Gateway/
├── src/
│   ├── main.rs              # Entry point
│   ├── providers/           # Provider implementations
│   │   ├── openai.rs
│   │   ├── anthropic.rs
│   │   ├── google.rs
│   │   ├── xai.rs
│   │   └── openrouter.rs
│   ├── llama/               # llama.cpp integration
│   ├── responses_api.rs     # Responses API handler
│   └── realtime_api.rs      # Realtime WebSocket API
├── config/
│   └── lmbridge.env          # Configuration
├── Cargo.toml               # Rust dependencies
└── build.rs                 # Build script (llama.cpp)

OpenAI Provider

Supported Models

  • GPT-5: gpt-5, gpt-5-mini, gpt-5-nano
  • GPT-4o: gpt-4o, gpt-4o-mini
  • O-series: o1, o3-mini

Features

  • Tools (function calling)
  • Vision (image input)
  • Streaming
  • Structured outputs (JSON mode)
  • Response format control

Configuration

# lmbridge.env
OPENAI_API_KEY=sk-proj-...

Model Key Format

openai:gpt-4o
openai:gpt-5-mini
openai:o3-mini

Example Request

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true
}

Anthropic (Claude) Provider

Supported Models

  • Claude Opus 4: claude-opus-4, claude-opus-4.1
  • Claude Sonnet 4: claude-sonnet-4
  • Claude 3.7: claude-3-7-sonnet-20250219
  • Claude 3.5: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022

Features

  • Tools (function calling)
  • Vision (image input)
  • Streaming
  • Extended thinking (thinking blocks)
  • System prompts with caching

Configuration

# lmbridge.env
ANTHROPIC_API_KEY=sk-ant-...

Model Key Format

anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-3-7-sonnet-20250219

Extended Thinking

Claude models support extended thinking mode for complex reasoning:

{
  "model": "anthropic:claude-opus-4",
  "messages": [...],
  "thinking": {
    "type": "enabled",
    "budget_tokens": 10000
  }
}

Google (Gemini) Provider

Supported Models

  • Gemini 2.5: gemini-2.5-flash, gemini-2.5-pro
  • Gemini 2.0: gemini-2.0-flash-exp
  • Gemini 1.5: gemini-1.5-pro, gemini-1.5-flash

Features

  • Tools (function calling)
  • Vision (image/video input)
  • Streaming
  • Structured outputs (JSON mode)
  • Large context windows (up to 2M tokens)

Configuration

# lmbridge.env
GOOGLE_API_KEY=AIza...

Model Key Format

google:gemini-2.5-flash
google:gemini-2.5-pro

Structured Output

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "user_data",
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        }
      }
    }
  }
}

xAI (Grok) Provider

Supported Models

  • Grok-4 Fast: grok-4-fast, grok-4-fast-no-reasoning

Features

  • High-speed inference
  • Streaming
  • Reasoning/no-reasoning modes

Configuration

# lmbridge.env
XAI_API_KEY=xai-...

Model Key Format

xai:grok-4-fast
xai:grok-4-fast-no-reasoning

OpenRouter Provider

Supported Models

Access hundreds of models through the OpenRouter API:

  • OpenAI: GPT-4o, GPT-5 series
  • Anthropic: Claude Opus, Sonnet
  • Google: Gemini models
  • Meta: Llama models
  • Mistral: Mistral, Mixtral
  • Many more via OpenRouter catalog

Features

  • Access to 100+ models through one API
  • Automatic model routing
  • Streaming support
  • OpenAI Responses API compatible

Configuration

# Environment variables
export OPENROUTER_API_KEY=your-key-here
export OPENROUTER_HTTP_REFERER=your-app-url  # Optional
export OPENROUTER_X_TITLE=your-app-name       # Optional

Model Key Format

openrouter:openai/gpt-4o
openrouter:anthropic/claude-3.5-sonnet
openrouter:meta-llama/llama-3-70b

Gateway llama.cpp Provider

Overview

Gateway includes built-in llama.cpp for local GGUF model inference with GPU acceleration.

Featured Models

  • Gemma-3-270m-it-Q4_K_M: Test model (270M params)
  • Gemma-3-4b-it-q4_0: Vision model (4B params)
  • Phi-4-mini-instruct-Q4_K_M: Instruction-tuned (14B params)
  • Qwen3-14B-Q8_0: Large context (14B params)
  • PLaMo-2-translate-Q4_0: EN/JP translation

GPU Acceleration

Gateway automatically detects CUDA and offloads layers to GPU:

# lmbridge.env
GATEWAY_GPU_LAYERS=-1  # -1 = all layers to GPU
GATEWAY_THREADS=8      # CPU threads for non-GPU parts

Configuration

# lmbridge.env
GATEWAY_MODEL_DIR=/path/to/models
GATEWAY_GPU_LAYERS=-1
GATEWAY_THREADS=8
GATEWAY_TOP_K=40
GATEWAY_TOP_P=0.9
GATEWAY_STRUCTURED_MODE=auto  # auto, strict, prompt, off
GATEWAY_STRUCTURED_PLAN=auto  # auto, system, off

Model Key Format

lmbridge:phi-4-mini-instruct
lmbridge:qwen3-14b
lmbridge:plamo-2-translate

Vision Support

For vision models, place mmproj file alongside the model:

models/llama/
├── Gemma-3-4b-it-q4_0.gguf
└── Gemma-3-4b-it-mmproj-q4_0.gguf

Vision Request Example

{
  "model": "lmbridge:gemma-3-4b-it",
  "messages": [
    {
      "role": "user",
      "content": [
        {
          "type": "text",
          "text": "What's in this image?"
        },
        {
          "type": "image_url",
          "image_url": {
            "url": "data:image/jpeg;base64,..."
          }
        }
      ]
    }
  ]
}

Responses API

Endpoint: POST /v1/responses

OpenAI-compatible text generation API.

Request Format

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "system",
      "content": "You are a helpful assistant."
    },
    {
      "role": "user",
      "content": "Hello!"
    }
  ],
  "stream": true,
  "temperature": 0.7,
  "max_tokens": 1000,
  "tools": [...],
  "tool_choice": "auto"
}

Response Format (Non-Streaming)

{
  "id": "resp_123",
  "object": "response",
  "created": 1234567890,
  "model": "openai:gpt-4o",
  "choices": [
    {
      "index": 0,
      "message": {
        "role": "assistant",
        "content": "Hello! How can I help you?"
      },
      "finish_reason": "stop"
    }
  ],
  "usage": {
    "prompt_tokens": 10,
    "completion_tokens": 8,
    "total_tokens": 18
  }
}

Streaming Response (SSE)

When stream: true, Gateway returns Server-Sent Events:

data: {"type":"response.start","response":{"id":"resp_123"}}

data: {"type":"content.delta","delta":{"text":"Hello"}}

data: {"type":"content.delta","delta":{"text":"!"}}

data: {"type":"response.done","response":{...}}

Tool Calling

{
  "model": "openai:gpt-4o",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "get_weather",
        "description": "Get weather for a location",
        "parameters": {
          "type": "object",
          "properties": {
            "location": { "type": "string" }
          },
          "required": ["location"]
        }
      }
    }
  ],
  "tool_choice": "auto"
}

Response with tool call:

{
  "choices": [
    {
      "message": {
        "role": "assistant",
        "tool_calls": [
          {
            "id": "call_123",
            "type": "function",
            "function": {
              "name": "get_weather",
              "arguments": "{\"location\":\"Tokyo\"}"
            }
          }
        ]
      }
    }
  ]
}

Realtime API (WebSocket)

Endpoint: ws://localhost:8080/v1/realtime

WebSocket-based realtime communication with LLMs.

Connection

const ws = new WebSocket('ws://localhost:8080/v1/realtime',
  ['openai-realtime-v1'])

ws.onopen = () => {
  console.log('Connected')
}

ws.onmessage = (event) => {
  const data = JSON.parse(event.data)
  console.log('Received:', data)
}

Send Message

ws.send(JSON.stringify({
  type: 'response.create',
  response: {
    modalities: ['text'],
    instructions: 'You are a helpful assistant.',
    input: [
      {
        type: 'message',
        role: 'user',
        content: [
          {
            type: 'input_text',
            text: 'Hello!'
          }
        ]
      }
    ]
  }
}))

Receive Events

ws.onmessage = (event) => {
  const msg = JSON.parse(event.data)

  switch (msg.type) {
    case 'response.text.delta':
      console.log('Text:', msg.delta)
      break
    case 'response.done':
      console.log('Done')
      break
  }
}

Service Control API

Get Status

Endpoint: GET /v1/service/status

{
  "status": "ok",
  "cache": {
    "enabled": true,
    "entries": 42
  },
  "uptime_seconds": 3600
}

Stop Current Inference

Endpoint: POST /v1/service/stop

Cancels currently running inference.

Shutdown Gateway

Endpoint: POST /v1/service/shutdown

Gracefully shuts down the Gateway server.

Model Management (llama.cpp)

List Loaded Models

Endpoint: GET /v1/local/models

{
  "models": [
    {
      "name": "phi-4-mini-instruct",
      "path": "/models/Phi-4-mini-instruct-Q4_K_M.gguf",
      "loaded": true,
      "gpu_layers": 40
    }
  ]
}

Preload Model

Endpoint: POST /v1/local/models/load

{
  "model": "phi-4-mini-instruct",
  "gpu_layers": -1
}

Unload Model

Endpoint: POST /v1/local/models/unload

{
  "model": "phi-4-mini-instruct"
}

GPU Acceleration

CUDA Support

Gateway automatically detects CUDA and uses GPU for llama.cpp inference.

Configuration

# lmbridge.env
GATEWAY_GPU_LAYERS=-1  # -1 = all layers, 0 = CPU only, N = N layers to GPU

Check GPU Usage

# Terminal 1: Start Gateway
cargo run --release --bin Gateway

# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi

Performance

GPU acceleration provides significant speedup:

  • CPU only: ~10-20 tokens/sec (14B model)
  • GPU (RTX 3090): ~80-100 tokens/sec (14B model)
  • GPU (RTX 4090): ~120-150 tokens/sec (14B model)

Structured Output

JSON Mode

Force model to output valid JSON:

{
  "model": "openai:gpt-4o",
  "messages": [
    {
      "role": "user",
      "content": "Extract name and age from: John is 30 years old"
    }
  ],
  "response_format": {
    "type": "json_object"
  }
}

JSON Schema (Google/OpenAI)

{
  "model": "google:gemini-2.5-flash",
  "messages": [...],
  "response_format": {
    "type": "json_schema",
    "json_schema": {
      "name": "person",
      "strict": true,
      "schema": {
        "type": "object",
        "properties": {
          "name": { "type": "string" },
          "age": { "type": "number" }
        },
        "required": ["name", "age"],
        "additionalProperties": false
      }
    }
  }
}

llama.cpp Structured Mode

Gateway llama.cpp supports structured output via grammar:

# lmbridge.env
GATEWAY_STRUCTURED_MODE=auto  # auto, strict, prompt, off
GATEWAY_STRUCTURED_PLAN=auto  # auto, system, off

Tool Calling

Define Tools

{
  "model": "anthropic:claude-sonnet-4",
  "messages": [...],
  "tools": [
    {
      "type": "function",
      "function": {
        "name": "search_web",
        "description": "Search the web",
        "parameters": {
          "type": "object",
          "properties": {
            "query": {
              "type": "string",
              "description": "Search query"
            }
          },
          "required": ["query"]
        }
      }
    }
  ]
}

Tool Choice

  • "auto" - Model decides
  • "none" - Never use tools
  • {"type":"function","function":{"name":"search_web"}} - Force specific tool

Handle Tool Calls

// 1. Send request with tools
const response = await fetch('/v1/responses', {
  method: 'POST',
  body: JSON.stringify({ model, messages, tools })
})

// 2. Check for tool calls
if (response.choices[0].message.tool_calls) {
  const toolCall = response.choices[0].message.tool_calls[0]

  // 3. Execute tool
  const result = await executeToolLocally(toolCall)

  // 4. Send result back to model
  messages.push({
    role: 'tool',
    tool_call_id: toolCall.id,
    content: JSON.stringify(result)
  })

  // 5. Get final response
  const finalResponse = await fetch('/v1/responses', {
    method: 'POST',
    body: JSON.stringify({ model, messages, tools })
  })
}

Gateway Configuration

File: apps/Gateway/config/gateway.env

Full Configuration

# Server
GATEWAY_HOST=0.0.0.0
GATEWAY_PORT=8080

# API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
XAI_API_KEY=xai-...

# OpenRouter
OPENROUTER_API_KEY=your-key-here

# llama.cpp (local models)
GATEWAY_MODEL_DIR=/path/to/models
GATEWAY_GPU_LAYERS=-1
GATEWAY_TOP_K=40
GATEWAY_TOP_P=0.9
GATEWAY_STRUCTURED_MODE=auto
GATEWAY_STRUCTURED_PLAN=auto

# Logging
LOG_LEVEL=info
SQLITE_DB_PATH=./lmbridge.sqlite

# Cache
CACHE_ENABLED=true
CACHE_TTL_SECONDS=300

Build

Build & Run

Run
# Clone repository (URL will be provided)
cd ExtendedLM/apps/Gateway

# Set model directory
export GATEWAY_MODEL_DIR=$PWD/../../models/llama

# Build (downloads llama.cpp automatically)
cargo build --release

# Run
cargo run --release --bin Gateway

Docker (Alternative)

# Build Docker image
docker build -t lmbridge .

# Run container
docker run -p 8080:8080 \
  -v ./models:/models \
  -e GATEWAY_MODEL_DIR=/models \
  lmbridge