llmfit: Find the Perfect LLM for Your Hardware
A terminal tool that right-sizes LLM models to your system's RAM, CPU, and GPU with intelligent scoring and runtime provider integration.
Introduction
With the explosion of Large Language Models (LLMs), developers and enthusiasts face a common challenge: finding the right model for their hardware. llmfit solves this problem by providing an intelligent terminal tool that detects your system specifications and scores hundreds of models across multiple dimensions.
Whether you're running on a high-end GPU server, a modest laptop with integrated graphics, or an Apple Silicon Mac, llmfit helps you identify which models will actually run well on your machine—before you download them.
What Makes llmfit Special
- Hundreds of Models: Covers models from Meta Llama, Mistral, Qwen, Google Gemma, Microsoft Phi, DeepSeek, and many more
- Multi-Dimensional Scoring: Evaluates models across quality, speed, fit, and context dimensions
- Smart Hardware Detection: Automatically detects CPU, RAM, and GPU (including VRAM) across multiple platforms
- Runtime Integration: Works with Ollama, llama.cpp, MLX, Docker Model Runner, and LM Studio
Key Features
Interactive TUI & CLI
llmfit ships with a beautiful terminal UI (TUI) by default, making it easy to navigate, search, and filter models. For automation workflows, a classic CLI mode is also available with JSON output options.
TUI Keyboard Shortcuts
| Key | Action |
|---|---|
Up/Down | Navigate models |
/ | Enter search mode |
f | Cycle fit filter |
a | Cycle availability filter |
s | Cycle sort column |
v | Enter Visual mode |
p | Open Plan mode |
t | Cycle color themes |
Enter | Toggle detail view |
d | Download selected model |
q | Quit |
Vim-like Modes
For power users, llmfit implements Vim-inspired modes:
- Normal mode: Default navigation and filtering
- Visual mode: Select multiple models for comparison
- Select mode: Column-based filtering with arrow keys
- Plan mode: Hardware planning for specific model configurations
Installation
Windows
scoop install llmfit
macOS / Linux
brew install llmfit
curl -fsSL https://llmfit.axjns.dev/install.sh | sh
Install without sudo:
Use this command to install to ~/.local/bin instead of system directories:
curl -fsSL https://llmfit.axjns.dev/install.sh | sh -s -- --local
Docker / Podman
docker run ghcr.io/alexsjones/llmfit
podman run ghcr.io/alexsjones/llmfit recommend --use-case coding | jq '.models[].name'
Build from Source
git clone https://github.com/AlexsJones/llmfit.git
cd llmfit
cargo build --release
# Binary at target/release/llmfit
Usage
Basic Commands
llmfit
llmfit --cli
llmfit fit --perfect -n 5
llmfit system
llmfit search "llama 8b"
llmfit info "Mistral-7B"
llmfit recommend --json --limit 5
llmfit recommend --json --use-case coding --limit 3
Hardware Planning
The Plan mode inverts the typical workflow—instead of asking "what fits my hardware?", it estimates "what hardware is needed for this model config?"
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --quant mlx-4bit
llmfit plan "Qwen/Qwen3-4B-MLX-4bit" --context 8192 --target-tps 25 --json
GPU Memory Override
When GPU VRAM auto-detection fails (broken drivers, VMs, passthrough), manually specify your GPU's VRAM:
# 32 GB VRAM
llmfit --memory=32G
# 24 GB VRAM
llmfit --memory=24G --cli
llmfit --memory=24G fit --perfect -n 5
REST API Server
llmfit can run as an HTTP API server, exposing the same scoring and recommendation data via REST endpoints:
llmfit serve --host 0.0.0.0 --port 8787
| Endpoint | Description |
|---|---|
GET /health |
Liveness check |
GET /api/v1/system |
Node hardware info |
GET /api/v1/models |
Full model list with filters |
GET /api/v1/models/top |
Top runnable models for scheduling |
GET /api/v1/models/{model} |
Search by model name |
How It Works
Hardware Detection
llmfit automatically detects your system specifications across multiple platforms:
- NVIDIA: Multi-GPU support via
nvidia-smi, aggregates VRAM across all detected GPUs - AMD: Detected via
rocm-smi - Intel Arc: Discrete VRAM via sysfs, integrated via
lspci - Apple Silicon: Unified memory via
system_profiler, VRAM = system RAM - Ascend: Detected via
npu-smi
Multi-Dimensional Scoring
Each model is scored across four dimensions (0–100 each):
| Dimension | What It Measures |
|---|---|
| Quality | Parameter count, model family reputation, quantization penalty, task alignment |
| Speed | Estimated tokens/sec based on backend, params, and quantization |
| Fit | Memory utilization efficiency (sweet spot: 50–80% of available memory) |
| Context | Context window capability vs target for use case |
Dimensions are combined into a weighted composite score. Weights vary by use-case category (General, Coding, Reasoning, Chat, Multimodal, Embedding).
Dynamic Quantization Selection
Instead of assuming a fixed quantization, llmfit walks a hierarchy from Q8_0 (best quality) down to Q2_K (most compressed), picking the highest quality that fits in available memory.
If nothing fits at full context, it automatically tries again at half context length.
MoE Support
Models with Mixture-of-Experts architectures (Mixtral, DeepSeek-V2/V3) are detected automatically. Only a subset of experts is active per token, so effective VRAM requirement is much lower than total parameter count suggests.
For example, Mixtral 8x7B has 46.7B total parameters but only activates ~12.9B per token, reducing VRAM from 23.9 GB to ~6.6 GB with expert offloading.
Speed Estimation
Token generation in LLM inference is memory-bandwidth-bound. llmfit uses actual GPU memory bandwidth to estimate throughput:
(bandwidth_GB/s / model_size_GB) × efficiency_factor
The efficiency factor (0.55) accounts for kernel overhead, KV-cache reads, and memory controller effects. The bandwidth lookup table covers ~80 GPUs across NVIDIA, AMD, and Apple Silicon families.
Runtime Provider Integration
llmfit integrates with multiple local runtime providers for seamless model management:
Ollama Integration
Automatically detects installed Ollama models, displays download status, and enables one-click pulls directly from the TUI.
- Queries
GET /api/tagsto list installed models - Sends
POST /api/pullto download new models - Shows green checkmark (✓) for installed models in the Inst column
- Connects to
http://localhost:11434by default
Remote Ollama Support
Connect to Ollama on different machines using environment variables:
OLLAMA_HOST="http://192.168.1.100:11434" llmfit
llama.cpp Integration
Direct GGUF downloads from Hugging Face with local cache detection. Maps HF model names to llama.cpp tag format and marks models as installed when matching GGUF files are present locally.
Docker Model Runner Integration
Queries Docker Desktop's built-in model serving, matches models using Ollama-style tag mapping, and pulls via docker model pull.
LM Studio Integration
Connects to LM Studio's local server with built-in model download capabilities. Accepts HuggingFace model names directly and tracks download progress via polling.
Model Name Mapping
llmfit maintains an accurate mapping between HuggingFace names (e.g., Qwen/Qwen2.5-Coder-14B-Instruct) and runtime-specific naming schemes:
- Ollama:
qwen2.5-coder:14b - Docker Model Runner:
ai/qwen2.5-coder:14b - LM Studio: Direct HuggingFace name
Platform Support
| Platform | Status |
|---|---|
| Linux | Full support. GPU detection via nvidia-smi (NVIDIA), rocm-smi (AMD), sysfs/lspci (Intel Arc), and npu-smi (Ascend) |
| macOS (Apple Silicon) | Full support. Detects unified memory via system_profiler. VRAM = system RAM. Models run via Metal GPU acceleration. |
| macOS (Intel) | RAM and CPU detection works. Discrete GPU detection if nvidia-smi available. |
| Windows | RAM and CPU detection works. NVIDIA GPU detection via nvidia-smi if installed. |
| Android / Termux | CPU and RAM detection usually work. Mobile GPU autodetection not currently supported. |
GPU Detection Table
| Vendor | Detection Method | VRAM Reporting |
|---|---|---|
| NVIDIA | nvidia-smi | Exact dedicated VRAM |
| AMD | rocm-smi | Detected (VRAM may be unknown) |
| Intel Arc (discrete) | sysfs (mem_info_vram_total) | Exact dedicated VRAM |
| Intel Arc (integrated) | lspci | Shared system memory |
| Apple Silicon | system_profiler | Unified memory (= system RAM) |
| Ascend | npu-smi | Detected (VRAM may be unknown) |
Built-in Themes
Press t to cycle through 10 beautiful color themes. Your selection is automatically saved and restored on next launch:
| Theme | Description |
|---|---|
| Default | Original llmfit colors |
| Dracula | Dark purple background with pastel accents |
| Solarized | Ethan Schoonover's Solarized Dark palette |
| Nord | Arctic, cool blue-gray tones |
| Monokai | Monokai Pro warm syntax colors |
| Gruvbox | Retro groove palette with warm earth tones |
| Catppuccin Latte | 🌻 Light theme — harmonious pastel inversion |
| Catppuccin Frappé | 🪴 Low-contrast dark — muted, subdued aesthetic |
| Catppuccin Macchiato | 🌺 Medium-contrast dark — gentle, soothing tones |
| Catppuccin Mocha | 🌿 Darkest variant — cozy with color-rich accents |
Conclusion
llmfit is an essential tool for anyone working with local LLMs. By combining intelligent hardware detection, multi-dimensional model scoring, and seamless runtime provider integration, it takes the guesswork out of model selection.
Whether you're setting up a personal development environment, deploying models in production, or just exploring the capabilities of local AI, llmfit helps you make informed decisions about which models will actually perform well on your hardware.
The open-source nature, cross-platform support, and active development make llmfit a valuable addition to any LLM enthusiast's toolkit.
Get Started
Install llmfit today and discover the perfect models for your system:
# macOS/Linux (Homebrew)
brew install llmfit
# Windows (Scoop)
scoop install llmfit
# Or use the quick install script
curl -fsSL https://llmfit.axjns.dev/install.sh | sh