A compound AI layer that orchestrates multiple LLMs in parallel, runs recursive self‑improvement cycles, detects contradictions, and converges to a robust consensus. Deliver 95–98% task accuracy with up to 150× lower cost per decision.
What breaks in production — and how MACO fixes it
A single LLM typically delivers 70–85% accuracy on complex reasoning tasks. Hallucinations, missing edge cases, and no self‑check loop are common failure modes (see arXiv:2509.23537).
Parallel consensus across 4–5 models with recursive refinement and explicit contradiction handling yields +20–25% accuracy uplift over single‑model baselines (arXiv:2512.20184).
Direct GPT‑4 calls cost around $0.03 / 1K tokens. At 1M requests per month, this easily becomes a $30K+/month line item on your infra bill.
Smart routing: cheap models handle filtering and bulk work, expensive models are activated only for final consensus. Effective cost drops to ≈$0.0002 per request in many workloads.
An evolutionary step beyond “just an LLM”
Iterative cross‑pollination: each model sees other models’ answers and improves its own. Convergence within 2–3 iterations (k_max) with cosine similarity typically above 0.95.
Automatic detection of logical conflicts between candidate solutions via a contradiction metric δ(sᵢ, sⱼ) > θ, plus resolution using specialized judge models or consensus voting.
Context‑aware model weights wᵢ = f(domain, history, complexity). For example, Qwen gets +20% weight on math tasks; Claude gets +15% on risk and legal analysis.
For every final answer MACO stores the full reasoning trace: decomposition → iterations → criteria → evaluations → conflicts → final justification. Built‑in auditability.
Tiered stack: Tier‑1 cheap models for screening, Tier‑2 balanced models for refinement, Tier‑3 premium models only for final consensus. Early stopping when confidence is high.
Factor analysis and clustering of evaluation criteria to remove duplicates and expose independent dimensions like Accuracy, Reasoning Depth, and Risk Coverage.
Backed by academic research and production‑grade systems
vs 70–85% for a single LLM
$0.0002 vs $0.03 per request
Parallel fan‑out + consensus
Within 2–3 refinement cycles
Multi‑agent orchestration outperforms single‑LLM setups by 15–27% (2026).
“Reaching Agreement Among LLM Agents” reports +22% accuracy with structured consensus.
Unsupervised cycle & contradiction detection achieves F1≈0.72 on agentic workflows.
“Compound AI Systems” defines the architecture pattern MACO builds upon.
Nine stages from raw query to audited answer
Q → T = {t₁, t₂, ..., tₖ} — break down a complex query into smaller, mostly independent subtasks.
S⁽⁰⁾ = {Mᵢ(tⱼ)} — all models process all subtasks in parallel to produce the initial solution set.
S⁽ᵏ⁺¹⁾ = refine(S⁽ᵏ⁾, {answers from peers}) until ∥S⁽ᵏ⁺¹⁾ − S⁽ᵏ⁾∥ < ε or k ≥ k_max.
If δ(sᵢ, sⱼ) > θ, conflicts are detected and resolved via additional judging rounds.
C* = PCA/cluster(∪Cᵢ) — independent quality axes instead of ad‑hoc criteria lists.
Borda‑style aggregation over ranked criteria from all models to get a shared importance ordering.
Per‑model weights wᵢ = f(domain, history, complexity) applied when scoring each candidate solution.
s* = argmax(Σᵢ wᵢⱼ · Eᵢⱼ) — weighted consensus vote across all models and criteria.
Full Trace: T → S⁽⁰..ᵏ⁾ → C* → conflicts → scores → s* — all persisted for audit and debugging.
From research prototype to enterprise‑grade platform
High‑impact domains for multi‑model consensus
Earnings analysis, valuation, portfolio recommendations, and risk modeling where every mistake is expensive.
Code review, test planning, migration strategies, and architecture decisions driven by multiple specialized “reviewer” models.
Contract review, due diligence, and regulatory checks where contradiction detection between clauses is critical.
Support ticket triage, review analysis, content moderation, and personalization at marketplace scale.
Literature reviews, hypothesis generation, and peer‑review support with transparent multi‑agent reasoning.
Log analysis, threat detection, and incident summarization with focus on recall and low false‑negatives.
The LLM orchestration market is projected to reach $2.5B by 2027 (CAGR ≈67%). MACO occupies a unique position: consensus‑grade quality at radically lower cost.
R&D, pilots, initial 5‑person core team.
Scaling, GTM, enterprise sales and support.
Strategic exit ($50M+ valuation) or path to IPO.
Enterprises have experimented with LLMs. Most now face accuracy, explainability, and cost ceilings — and are actively looking for compound AI solutions.
At least four key 2025–2026 papers show that multi‑agent reasoning and consensus outperform single‑model setups on complex benchmarks.
Mature LLM APIs, reliable cloud infra, and proven orchestration patterns make now the right time to productize consensus‑grade AI.
20+ years in software engineering and production AI systems, including CAIS‑style multi‑agent architectures.
Hands‑on LLM engineering and enterprise AI experience
Founder & Chief Architect
20+ years in software engineering, LLM orchestration, and multi‑agent systems. Author of the CAIS and MACO architectures.
Asyncio, PostgreSQL, Redis, Docker, production APIs.
LLM fine‑tuning, embeddings, vector DBs, evaluation.
Kubernetes, CI/CD, observability, cost optimization.
Enterprise B2B, AI products, roadmap and discovery.
Reach out to talk about investments, partnerships, or pilot deployments in your organization.