100% success rate across 10 cognitive categories. 15.9 req/s sustained throughput. 35ms median time-to-first-token. Running sovereign, on local silicon.
Stratified random sample of 100 responses (10 per category) scored on completeness, relevance, coherence, repetition, and category-specific format adherence.
94.9% across 10 categories. Code and math responses scored perfectly. Zero incoherent, repetitive, or off-topic answers detected.
End-to-end completion latency and time-to-first-token across all 10,000 queries. Latency scales linearly with output length — no processing bottlenecks or hanging requests.
35ms median TTFT — users see tokens start flowing almost instantly. Sub-50ms at P95 indicates stable, low-variance first-token delivery. No cloud round-trips. No API throttling. Pure local silicon speed.
| Category | Avg Tokens | p50 (ms) | p95 (ms) | tok/s |
|---|---|---|---|---|
| Translation | 25 | 329 | 399 | 76.0 |
| Creative | 44 | 558 | 786 | 78.9 |
| Classification | 61 | 807 | 966 | 75.6 |
| Extraction | 87 | 766 | 2,627 | 113.6 |
| Reasoning | 149 | 1,765 | 3,093 | 84.4 |
| Code | 178 | 2,239 | 3,178 | 79.5 |
| Summarization | 212 | 2,826 | 3,369 | 75.0 |
| Math | 214 | 2,837 | 3,391 | 75.4 |
| Factual | 229 | 3,018 | 3,453 | 75.9 |
| Roleplay | 234 | 3,079 | 3,465 | 76.0 |
Extracted Python code from 100 code-category responses, wrapped with test harnesses, and executed in sandboxed subprocesses. The two failures are token-limit truncations, not logic errors.
max_tokens=256 mid-expressionEvery number on this page was generated by a single script against a live vLLM instance. No cherry-picked results. Full JSONL output preserved.
# 10K query stress test: 10 categories × 1,000 queries $ python scripts/vllm_stress_test.py \ --url http://vllm-council:8002/v1 \ --model council --total 10000 \ --concurrency 32 --max-tokens 256 Total queries: 10,000 Successful: 10,000 (100.0%) Errors: 0 (0.0%) Wall time: 628.3s Throughput: 15.9 req/s Total tokens: 1,432,521 Token throughput: 2,280.1 tok/s # Quality analysis: 100 stratified samples Overall quality: 0.949 / 1.000 Grade distribution: 94 A · 6 B · 0 C/D/F # Code execution: 100 Python snippets verified Passed: 98 / 100 (98.0%)
All benchmarks executed on production-grade workstation hardware. Sovereign inference — zero cloud dependency, zero data leakage.