Gateway System
ExtendedLM Gateway is a Rust-based unified LLM gateway with OpenAI-compatible API supporting multiple providers and local inference.
Gateway provides a single API endpoint for all LLM providers (OpenAI, Anthropic, Google, xAI, Ollama) plus local model inference via llama.cpp. It handles request routing, response normalization, and caching automatically.
Key Features
- Unified API: OpenAI-compatible Responses API for all providers
- Local Inference: Built-in llama.cpp with GPU acceleration
- Streaming: Server-Sent Events (SSE) support
- Structured Output: JSON Schema validation
- Tool Calling: Function calling across providers
- Vision Support: Image input for multimodal models
- Realtime API: WebSocket-based realtime communication
- Request Logging: SQLite-based analytics
Supported Providers
OpenAI
GPT-4o, GPT-5 series
Anthropic
Claude Opus 4, Sonnet 4
Gemini 2.5 Flash/Pro
xAI
Grok-4 Fast
Ollama
Local models (Phi4, Qwen3, etc.)
llama.cpp
GGUF models with GPU
Architecture
Components
- Rust Core: High-performance HTTP server
- llama.cpp: Statically linked for local inference
- Provider Clients: API clients for each LLM provider
- Request Router: Route requests to appropriate provider
- Response Normalizer: Unify response formats
- SQLite Logger: Request/response analytics
Request Flow
- Client sends OpenAI-format request to Gateway
- Gateway parses model key (e.g., "openai:gpt-4o")
- Routes to appropriate provider or llama.cpp
- Normalizes response to OpenAI format
- Streams back to client via SSE
- Logs request/response to SQLite
Directory Structure
apps/Gateway/
├── src/
│ ├── main.rs # Entry point
│ ├── providers/ # Provider implementations
│ │ ├── openai.rs
│ │ ├── anthropic.rs
│ │ ├── google.rs
│ │ ├── xai.rs
│ │ └── ollama.rs
│ ├── llama/ # llama.cpp integration
│ ├── responses_api.rs # Responses API handler
│ └── realtime_api.rs # Realtime WebSocket API
├── config/
│ └── lmbridge.env # Configuration
├── Cargo.toml # Rust dependencies
└── build.rs # Build script (llama.cpp)
OpenAI Provider
Supported Models
- GPT-5: gpt-5, gpt-5-mini, gpt-5-nano
- GPT-4o: gpt-4o, gpt-4o-mini
- O-series: o1, o3-mini
Features
- Tools (function calling)
- Vision (image input)
- Streaming
- Structured outputs (JSON mode)
- Response format control
Configuration
# lmbridge.env
OPENAI_API_KEY=sk-proj-...
Model Key Format
openai:gpt-4o
openai:gpt-5-mini
openai:o3-mini
Example Request
{
"model": "openai:gpt-4o",
"messages": [
{
"role": "user",
"content": "Hello!"
}
],
"stream": true
}
Anthropic (Claude) Provider
Supported Models
- Claude Opus 4: claude-opus-4, claude-opus-4.1
- Claude Sonnet 4: claude-sonnet-4
- Claude 3.7: claude-3-7-sonnet-20250219
- Claude 3.5: claude-3-5-sonnet-20241022, claude-3-5-haiku-20241022
Features
- Tools (function calling)
- Vision (image input)
- Streaming
- Extended thinking (thinking blocks)
- System prompts with caching
Configuration
# lmbridge.env
ANTHROPIC_API_KEY=sk-ant-...
Model Key Format
anthropic:claude-opus-4
anthropic:claude-sonnet-4
anthropic:claude-3-7-sonnet-20250219
Extended Thinking
Claude models support extended thinking mode for complex reasoning:
{
"model": "anthropic:claude-opus-4",
"messages": [...],
"thinking": {
"type": "enabled",
"budget_tokens": 10000
}
}
Google (Gemini) Provider
Supported Models
- Gemini 2.5: gemini-2.5-flash, gemini-2.5-pro
- Gemini 2.0: gemini-2.0-flash-exp
- Gemini 1.5: gemini-1.5-pro, gemini-1.5-flash
Features
- Tools (function calling)
- Vision (image/video input)
- Streaming
- Structured outputs (JSON mode)
- Large context windows (up to 2M tokens)
Configuration
# lmbridge.env
GOOGLE_API_KEY=AIza...
Model Key Format
google:gemini-2.5-flash
google:gemini-2.5-pro
Structured Output
{
"model": "google:gemini-2.5-flash",
"messages": [...],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "user_data",
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "number" }
}
}
}
}
}
xAI (Grok) Provider
Supported Models
- Grok-4 Fast: grok-4-fast, grok-4-fast-no-reasoning
Features
- High-speed inference
- Streaming
- Reasoning/no-reasoning modes
Configuration
# lmbridge.env
XAI_API_KEY=xai-...
Model Key Format
xai:grok-4-fast
xai:grok-4-fast-no-reasoning
Ollama Provider
Supported Models
Any model available in your Ollama instance:
- Phi4: phi4
- Qwen3: qwen3-30b, qwen3-32b, qwen3-235b
- Gemma3: gemma3-27b
- Qwen2.5-coder: qwen2.5-coder-32b
- Deepseek-r1: deepseek-r1-671b
- Llama4: llama4-128x17b
- Many more...
Features
- Self-hosted models
- Full privacy
- Customizable
- Streaming
- Vision models (llava, etc.)
Configuration
# lmbridge.env
OLLAMA_BASE_URL=http://localhost:11434
Model Key Format
ollama:phi4
ollama:qwen3-30b
ollama:deepseek-r1-671b
Setup Ollama
# Install Ollama
curl -fsSL https://ollama.com/install.sh | sh
# Pull models
ollama pull phi4
ollama pull qwen3-30b
# Start server
ollama serve
Gateway llama.cpp Provider
Overview
Gateway includes built-in llama.cpp for local GGUF model inference with GPU acceleration.
Featured Models
- Gemma-3-270m-it-Q4_K_M: Test model (270M params)
- Gemma-3-4b-it-q4_0: Vision model (4B params)
- Phi-4-mini-instruct-Q4_K_M: Instruction-tuned (14B params)
- Qwen3-14B-Q8_0: Large context (14B params)
- PLaMo-2-translate-Q4_0: EN/JP translation
GPU Acceleration
Gateway automatically detects CUDA and offloads layers to GPU:
# lmbridge.env
LLAMA_GPU_LAYERS=-1 # -1 = all layers to GPU
LLAMA_THREADS=8 # CPU threads for non-GPU parts
Configuration
# lmbridge.env
LLAMA_MODEL_DIR=/path/to/models
LLAMA_GPU_LAYERS=-1
LLAMA_THREADS=8
LLAMA_TOP_K=40
LLAMA_TOP_P=0.9
LLAMA_STRUCTURED_MODE=auto # auto, strict, prompt, off
LLAMA_STRUCTURED_PLAN=auto # auto, system, off
Model Key Format
lmbridge:phi-4-mini-instruct
lmbridge:qwen3-14b
lmbridge:plamo-2-translate
Vision Support
For vision models, place mmproj file alongside the model:
models/llama/
├── Gemma-3-4b-it-q4_0.gguf
└── Gemma-3-4b-it-mmproj-q4_0.gguf
Vision Request Example
{
"model": "lmbridge:gemma-3-4b-it",
"messages": [
{
"role": "user",
"content": [
{
"type": "text",
"text": "What's in this image?"
},
{
"type": "image_url",
"image_url": {
"url": "data:image/jpeg;base64,..."
}
}
]
}
]
}
Responses API
Endpoint: POST /v1/responses
OpenAI-compatible text generation API.
Request Format
{
"model": "openai:gpt-4o",
"messages": [
{
"role": "system",
"content": "You are a helpful assistant."
},
{
"role": "user",
"content": "Hello!"
}
],
"stream": true,
"temperature": 0.7,
"max_tokens": 1000,
"tools": [...],
"tool_choice": "auto"
}
Response Format (Non-Streaming)
{
"id": "resp_123",
"object": "response",
"created": 1234567890,
"model": "openai:gpt-4o",
"choices": [
{
"index": 0,
"message": {
"role": "assistant",
"content": "Hello! How can I help you?"
},
"finish_reason": "stop"
}
],
"usage": {
"prompt_tokens": 10,
"completion_tokens": 8,
"total_tokens": 18
}
}
Streaming Response (SSE)
When stream: true, Gateway returns Server-Sent Events:
data: {"type":"response.start","response":{"id":"resp_123"}}
data: {"type":"content.delta","delta":{"text":"Hello"}}
data: {"type":"content.delta","delta":{"text":"!"}}
data: {"type":"response.done","response":{...}}
Tool Calling
{
"model": "openai:gpt-4o",
"messages": [...],
"tools": [
{
"type": "function",
"function": {
"name": "get_weather",
"description": "Get weather for a location",
"parameters": {
"type": "object",
"properties": {
"location": { "type": "string" }
},
"required": ["location"]
}
}
}
],
"tool_choice": "auto"
}
Response with tool call:
{
"choices": [
{
"message": {
"role": "assistant",
"tool_calls": [
{
"id": "call_123",
"type": "function",
"function": {
"name": "get_weather",
"arguments": "{\"location\":\"Tokyo\"}"
}
}
]
}
}
]
}
Realtime API (WebSocket)
Endpoint: ws://localhost:8080/v1/realtime
WebSocket-based realtime communication with LLMs.
Connection
const ws = new WebSocket('ws://localhost:8080/v1/realtime',
['openai-realtime-v1'])
ws.onopen = () => {
console.log('Connected')
}
ws.onmessage = (event) => {
const data = JSON.parse(event.data)
console.log('Received:', data)
}
Send Message
ws.send(JSON.stringify({
type: 'response.create',
response: {
modalities: ['text'],
instructions: 'You are a helpful assistant.',
input: [
{
type: 'message',
role: 'user',
content: [
{
type: 'input_text',
text: 'Hello!'
}
]
}
]
}
}))
Receive Events
ws.onmessage = (event) => {
const msg = JSON.parse(event.data)
switch (msg.type) {
case 'response.text.delta':
console.log('Text:', msg.delta)
break
case 'response.done':
console.log('Done')
break
}
}
Service Control API
Get Status
Endpoint: GET /v1/service/status
{
"status": "ok",
"cache": {
"enabled": true,
"entries": 42
},
"uptime_seconds": 3600
}
Stop Current Inference
Endpoint: POST /v1/service/stop
Cancels currently running inference.
Shutdown Gateway
Endpoint: POST /v1/service/shutdown
Gracefully shuts down the Gateway server.
Model Management (llama.cpp)
List Loaded Models
Endpoint: GET /v1/llama/models
{
"models": [
{
"name": "phi-4-mini-instruct",
"path": "/models/Phi-4-mini-instruct-Q4_K_M.gguf",
"loaded": true,
"gpu_layers": 40
}
]
}
Preload Model
Endpoint: POST /v1/llama/models/load
{
"model": "phi-4-mini-instruct",
"gpu_layers": -1
}
Unload Model
Endpoint: POST /v1/llama/models/unload
{
"model": "phi-4-mini-instruct"
}
GPU Acceleration
CUDA Support
Gateway automatically detects CUDA and uses GPU for llama.cpp inference.
Configuration
# lmbridge.env
LLAMA_GPU_LAYERS=-1 # -1 = all layers, 0 = CPU only, N = N layers to GPU
Check GPU Usage
# Terminal 1: Start Gateway
cargo run --release --bin Gateway
# Terminal 2: Monitor GPU
watch -n 1 nvidia-smi
Performance
GPU acceleration provides significant speedup:
- CPU only: ~10-20 tokens/sec (14B model)
- GPU (RTX 3090): ~80-100 tokens/sec (14B model)
- GPU (RTX 4090): ~120-150 tokens/sec (14B model)
Structured Output
JSON Mode
Force model to output valid JSON:
{
"model": "openai:gpt-4o",
"messages": [
{
"role": "user",
"content": "Extract name and age from: John is 30 years old"
}
],
"response_format": {
"type": "json_object"
}
}
JSON Schema (Google/OpenAI)
{
"model": "google:gemini-2.5-flash",
"messages": [...],
"response_format": {
"type": "json_schema",
"json_schema": {
"name": "person",
"strict": true,
"schema": {
"type": "object",
"properties": {
"name": { "type": "string" },
"age": { "type": "number" }
},
"required": ["name", "age"],
"additionalProperties": false
}
}
}
}
llama.cpp Structured Mode
Gateway llama.cpp supports structured output via grammar:
# lmbridge.env
LLAMA_STRUCTURED_MODE=auto # auto, strict, prompt, off
LLAMA_STRUCTURED_PLAN=auto # auto, system, off
Tool Calling
Define Tools
{
"model": "anthropic:claude-sonnet-4",
"messages": [...],
"tools": [
{
"type": "function",
"function": {
"name": "search_web",
"description": "Search the web",
"parameters": {
"type": "object",
"properties": {
"query": {
"type": "string",
"description": "Search query"
}
},
"required": ["query"]
}
}
}
]
}
Tool Choice
"auto"- Model decides"none"- Never use tools{"type":"function","function":{"name":"search_web"}}- Force specific tool
Handle Tool Calls
// 1. Send request with tools
const response = await fetch('/v1/responses', {
method: 'POST',
body: JSON.stringify({ model, messages, tools })
})
// 2. Check for tool calls
if (response.choices[0].message.tool_calls) {
const toolCall = response.choices[0].message.tool_calls[0]
// 3. Execute tool
const result = await executeToolLocally(toolCall)
// 4. Send result back to model
messages.push({
role: 'tool',
tool_call_id: toolCall.id,
content: JSON.stringify(result)
})
// 5. Get final response
const finalResponse = await fetch('/v1/responses', {
method: 'POST',
body: JSON.stringify({ model, messages, tools })
})
}
Gateway Configuration
File: apps/Gateway/config/gateway.env
Full Configuration
# Server
GATEWAY_HOST=0.0.0.0
GATEWAY_PORT=8080
# API Keys
OPENAI_API_KEY=sk-proj-...
ANTHROPIC_API_KEY=sk-ant-...
GOOGLE_API_KEY=AIza...
XAI_API_KEY=xai-...
# Ollama
OLLAMA_BASE_URL=http://localhost:11434
# llama.cpp
LLAMA_MODEL_DIR=/path/to/models/llama
LLAMA_GPU_LAYERS=-1
LLAMA_THREADS=8
LLAMA_TOP_K=40
LLAMA_TOP_P=0.9
LLAMA_STRUCTURED_MODE=auto
LLAMA_STRUCTURED_PLAN=auto
# Logging
LOG_LEVEL=info
SQLITE_DB_PATH=./lmbridge.sqlite
# Cache
CACHE_ENABLED=true
CACHE_TTL_SECONDS=300
Build Build & Run
Run
# Clone repository (URL will be provided)
cd ExtendedLM/apps/Gateway
# Set model directory
export LLAMA_MODEL_DIR=$PWD/../../models/llama
# Build (downloads llama.cpp automatically)
cargo build --release
# Run
cargo run --release --bin Gateway
Docker (Alternative)
# Build Docker image
docker build -t lmbridge .
# Run container
docker run -p 8080:8080 \
-v ./models:/models \
-e LLAMA_MODEL_DIR=/models \
lmbridge