[](#omlx)
LLM inference, optimized for your Mac
Continuous batching and tiered KV caching, managed directly from your menu bar.
junkim.dot@gmail.com · https://omlx.ai/me
Install · Quickstart · Features · Models · CLI Configuration · Benchmarks · oMLX.ai
Every LLM server I tried made me choose between convenience and control. I wanted to pin everyday models in memory, auto-swap heavier ones on demand, set context limits - and manage it all from a menu bar.oMLX persists KV cache across a hot in-memory tier and cold SSD tier - even when context changes mid-conversation, all past context stays cached and reusable across requests, making local LLMs practical for real coding work with tools like Claude Code. That's why I built it.
[](#install)
[](#macos-app)
Download the .dmg from Releases, drag to Applications, done. The app includes in-app auto-update, so future upgrades are just one click.
[](#homebrew)
brew tap jundot/omlx https://github.com/jundot/omlx
brew install omlx
brew update && brew upgrade omlx
brew services start omlx
/opt/homebrew/opt/omlx/libexec/bin/pip install mcp
[](#from-source)
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e . # Core only
pip install -e ".[mcp]" # With MCP (Model Context Protocol) support
Requires macOS 15.0+ (Sequoia), Python 3.10+, and Apple Silicon (M1/M2/M3/M4).
[](#quickstart)
[](#macos-app-1)
Launch oMLX from your Applications folder. The Welcome screen guides you through three steps - model directory, server start, and first model download. That's it. To connect OpenClaw, OpenCode, or Codex, see Integrations.
[](#cli)
omlx serve --model-dir ~/models
The server discovers LLMs, VLMs, embedding models, and rerankers from subdirectories automatically. Any OpenAI-compatible client can connect to http://localhost:8000/v1. A built-in chat UI is also available at http://localhost:8000/admin/chat.
[](#homebrew-service)
If you installed via Homebrew, you can run oMLX as a managed background service:
brew services start omlx # Start (auto-restarts on crash)
brew services stop omlx # Stop
brew services restart omlx # Restart
brew services info omlx # Check status
The service runs omlx serve with zero-config defaults (~/.omlx/models, port 8000). To customize, either set environment variables (OMLX_MODEL_DIR, OMLX_PORT, etc.) or run omlx serve --model-dir /your/path once to persist settings to ~/.omlx/settings.json.
Logs are written to two locations:
$(brew --prefix)/var/log/omlx.log (stdout/stderr)~/.omlx/logs/server.log (structured application log)[](#features)
Supports text LLMs, vision-language models (VLM), OCR models, embeddings, and rerankers on Apple Silicon.
[](#admin-dashboard)
Web UI at /admin for real-time monitoring, model management, chat, benchmark, and per-model settings. Supports English, Korean, Japanese, and Chinese. All CDN dependencies are vendored for fully offline operation.
[](#vision-language-models)
Run VLMs with the same continuous batching and tiered KV cache stack as text LLMs. Supports multi-image chat, base64/URL/file image inputs, and tool calling with vision context. OCR models (DeepSeek-OCR, DOTS-OCR, GLM-OCR) are auto-detected with optimized prompts.
[](#tiered-kv-cache-hot--cold)
Block-based KV cache management inspired by vLLM, with prefix sharing and Copy-on-Write. The cache operates across two tiers:
[](#continuous-batching)
Handles concurrent requests through mlx-lm's BatchGenerator. Prefill and completion batch sizes are configurable.
[](#claude-code-optimization)
Context scaling support for running smaller context models with Claude Code. Scales reported token counts so that auto-compact triggers at the right timing, and SSE keep-alive prevents read timeouts during long prefill.
[](#multi-model-serving)
Load LLMs, VLMs, embedding models, and rerankers within the same server. Models are managed through a combination of automatic and manual controls:
[](#per-model-settings)
Configure sampling parameters, chat template kwargs, TTL, model alias, model type override, and more per model directly from the admin panel. Changes apply immediately without server restart.
/v1/models returns the alias, and requests accept both the alias and directory name.[](#built-in-chat)
Chat directly with any loaded model from the admin panel. Supports conversation history, model switching, dark mode, reasoning model output, and image upload for VLM/OCR models.
[](#model-downloader)
Search and download MLX models from HuggingFace directly in the admin dashboard. Browse model cards, check file sizes, and download with one click.
[](#integrations)
Set up OpenClaw, OpenCode, and Codex directly from the admin dashboard with a single click. No manual config editing required.
[](#performance-benchmark)
One-click benchmarking from the admin panel. Measures prefill (PP) and text generation (TG) tokens per second, with partial prefix cache hit testing for realistic performance numbers.
[](#macos-menubar-app)
Native PyObjC menubar app (not Electron). Start, stop, and monitor the server without opening a terminal. Includes persistent serving stats (survives restarts), auto-restart on crash, and in-app auto-update.
[](#api-compatibility)
Drop-in replacement for OpenAI and Anthropic APIs. Supports streaming usage stats (stream_options.include_usage), Anthropic adaptive thinking, and vision inputs (base64, URL).
Endpoint
Description
POST /v1/chat/completions
Chat completions (streaming)
POST /v1/completions
Text completions (streaming)
POST /v1/messages
Anthropic Messages API
POST /v1/embeddings
Text embeddings
POST /v1/rerank
Document reranking
GET /v1/models
List available models
[](#tool-calling--structured-output)
Supports all function calling formats available in mlx-lm, JSON schema validation, and MCP tool integration. Tool calling requires the model's chat template to support the tools parameter. The following model families are auto-detected via mlx-lm's built-in tool parsers:
Model Family
Format
Llama, Qwen, DeepSeek, etc.
JSON <tool_call>
Qwen3.5 Series
XML <function=...>
Gemma
<start_function_call>
GLM (4.7, 5)
<arg_key>/<arg_value> XML
MiniMax
Namespaced <minimax:tool_call>
Mistral
[TOOL_CALLS]
Kimi K2
<|tool_calls_section_begin|>
Longcat
<longcat_tool_call>
Models not listed above may still work if their chat template accepts tools and their output uses a recognized <tool_call> XML format. For tool-enabled streaming, assistant text is emitted incrementally while known tool-call control markup is suppressed from visible content; structured tool calls are emitted after parsing the completed turn.
[](#models)
Point --model-dir at a directory containing MLX-format model subdirectories. Two-level organization folders (e.g., mlx-community/model-name/) are also supported.
~/models/
├── Step-3.5-Flash-8bit/
├── Qwen3-Coder-Next-8bit/
├── gpt-oss-120b-MXFP4-Q8/
├── Qwen3.5-122B-A10B-4bit/
└── bge-m3/
Models are auto-detected by type. You can also download models directly from the admin dashboard.
Type
Models
LLM
Any model supported by mlx-lm
VLM
Qwen3.5 Series, GLM-4V, Pixtral, and other mlx-vlm models
OCR
DeepSeek-OCR, DOTS-OCR, GLM-OCR
Embedding
BERT, BGE-M3, ModernBERT
Reranker
ModernBERT, XLM-RoBERTa
[](#cli-configuration)
omlx serve --model-dir ~/models --max-model-memory 32GB
omlx serve --model-dir ~/models --max-process-memory 80%
omlx serve --model-dir ~/models --paged-ssd-cache-dir ~/.omlx/cache
omlx serve --model-dir ~/models --hot-cache-max-size 20%
omlx serve --model-dir ~/models --prefill-batch-size 8 --completion-batch-size 32
omlx serve --model-dir ~/models --mcp-config mcp.json
omlx serve --model-dir ~/models --hf-endpoint https://hf-mirror.com
omlx serve --model-dir ~/models --api-key your-secret-key
All settings can also be configured from the web admin panel at /admin. Settings are persisted to ~/.omlx/settings.json, and CLI flags take precedence.
Architecture
FastAPI Server (OpenAI / Anthropic API)
│
├── EnginePool (multi-model, LRU eviction, TTL, manual load/unload)
│ ├── BatchedEngine (LLMs, continuous batching)
│ ├── VLMEngine (vision-language models)
│ ├── EmbeddingEngine
│ └── RerankerEngine
│
├── ProcessMemoryEnforcer (total memory limit, TTL checks)
│
├── Scheduler (FCFS, configurable batch sizes)
│ └── mlx-lm BatchGenerator
│
└── Cache Stack
├── PagedCacheManager (GPU, block-based, CoW, prefix sharing)
├── Hot Cache (in-memory tier, write-back)
└── PagedSSDCacheManager (SSD cold tier, safetensors format)
[](#development)
[](#cli-server)
git clone https://github.com/jundot/omlx.git
cd omlx
pip install -e ".[dev]"
pytest -m "not slow"
[](#macos-app-2)
Requires Python 3.11+ and venvstacks (pip install venvstacks).
cd packaging
python build.py
python build.py --skip-venv
python build.py --dmg-only
See packaging/README.md for details on the app bundle structure and layer configuration.
[](#contributing)
Contributions are welcome! See Contributing Guide for details.
[](#license)
[](#acknowledgments)
原网址: 访问
创建于: 2026-03-19 13:57:33
目录: default
标签: 无
未标明原创文章均为采集,版权归作者所有,转载无需和我联系,请注明原出处,南摩阿彌陀佛,知识,不只知道,要得到
最新评论