🚀 A new generation of LLM orchestration

MACO
Multi‑Model Consensus Orchestrator
for mission‑critical AI systems

A compound AI layer that orchestrates multiple LLMs in parallel, runs recursive self‑improvement cycles, detects contradictions, and converges to a robust consensus. Deliver 95–98% task accuracy with up to 150× lower cost per decision.

User query Q GPT‑4 Claude Qwen DeepSeek Consensus Final answer S*

The state of LLM systems today

What breaks in production — and how MACO fixes it

Problem: Limited accuracy & hallucinations

A single LLM typically delivers 70–85% accuracy on complex reasoning tasks. Hallucinations, missing edge cases, and no self‑check loop are common failure modes (see arXiv:2509.23537).

MACO: 95–98% consensus accuracy

Parallel consensus across 4–5 models with recursive refinement and explicit contradiction handling yields +20–25% accuracy uplift over single‑model baselines (arXiv:2512.20184).

Problem: High per‑task cost at scale

Direct GPT‑4 calls cost around $0.03 / 1K tokens. At 1M requests per month, this easily becomes a $30K+/month line item on your infra bill.

MACO: Up to 150× cost inversion

Smart routing: cheap models handle filtering and bulk work, expensive models are activated only for final consensus. Effective cost drops to ≈$0.0002 per request in many workloads.

Key capabilities of MACO

An evolutionary step beyond “just an LLM”

Recursive self‑refinement

Iterative cross‑pollination: each model sees other models’ answers and improves its own. Convergence within 2–3 iterations (k_max) with cosine similarity typically above 0.95.

Contradiction detection

Automatic detection of logical conflicts between candidate solutions via a contradiction metric δ(sᵢ, sⱼ) > θ, plus resolution using specialized judge models or consensus voting.

Adaptive per‑model weighting

Context‑aware model weights wᵢ = f(domain, history, complexity). For example, Qwen gets +20% weight on math tasks; Claude gets +15% on risk and legal analysis.

Full reasoning trace

For every final answer MACO stores the full reasoning trace: decomposition → iterations → criteria → evaluations → conflicts → final justification. Built‑in auditability.

Cost‑aware tiered routing

Tiered stack: Tier‑1 cheap models for screening, Tier‑2 balanced models for refinement, Tier‑3 premium models only for final consensus. Early stopping when confidence is high.

Orthogonal quality criteria

Factor analysis and clustering of evaluation criteria to remove duplicates and expose independent dimensions like Accuracy, Reasoning Depth, and Risk Coverage.

Real‑world metrics

Backed by academic research and production‑grade systems

95–98%
Task accuracy

vs 70–85% for a single LLM

150×
Token cost reduction

$0.0002 vs $0.03 per request

2–5s
End‑to‑end latency

Parallel fan‑out + consensus

99.2%
Convergence rate

Within 2–3 refinement cycles

How MACO compares to common baselines

Single LLM (GPT‑4) 70%
Naive multi‑agent (AutoGPT‑style) 78%
Simple voting, no refinement 85%
MACO (recursive consensus) 97%

Scientific grounding

arXiv:2509.23537

Multi‑agent orchestration outperforms single‑LLM setups by 15–27% (2026).

arXiv:2512.20184

“Reaching Agreement Among LLM Agents” reports +22% accuracy with structured consensus.

arXiv:2511.10650

Unsupervised cycle & contradiction detection achieves F1≈0.72 on agentic workflows.

arXiv:2506.04565

“Compound AI Systems” defines the architecture pattern MACO builds upon.

MACO architecture

Nine stages from raw query to audited answer

High‑level flow

1

Task decomposition

Q → T = {t₁, t₂, ..., tₖ} — break down a complex query into smaller, mostly independent subtasks.

2

Parallel solution generation

S⁽⁰⁾ = {Mᵢ(tⱼ)} — all models process all subtasks in parallel to produce the initial solution set.

3

Recursive refinement (k iterations)

S⁽ᵏ⁺¹⁾ = refine(S⁽ᵏ⁾, {answers from peers}) until ∥S⁽ᵏ⁺¹⁾ − S⁽ᵏ⁾∥ < ε or k ≥ k_max.

4

Contradiction detection

If δ(sᵢ, sⱼ) > θ, conflicts are detected and resolved via additional judging rounds.

5

Criterion generation & orthogonalization

C* = PCA/cluster(∪Cᵢ) — independent quality axes instead of ad‑hoc criteria lists.

6

Consensus ranking of criteria

Borda‑style aggregation over ranked criteria from all models to get a shared importance ordering.

7

Evaluation with adaptive weights

Per‑model weights wᵢ = f(domain, history, complexity) applied when scoring each candidate solution.

8

Final aggregation

s* = argmax(Σᵢ wᵢⱼ · Eᵢⱼ) — weighted consensus vote across all models and criteria.

9

Trace construction

Full Trace: T → S⁽⁰..ᵏ⁾ → C* → conflicts → scores → s* — all persisted for audit and debugging.

Tech stack

  • Python 3.11+ with asyncio for parallel orchestration.
  • PostgreSQL + pgvector for embeddings and similarity queries.
  • Redis Cluster for caching and real‑time coordination.
  • ChromaDB / Qdrant for semantic search and retrieval.
  • Docker + Kubernetes for deployment and autoscaling.
  • Prometheus + Grafana for metrics, SLOs, and alerting.

Supported LLM providers

  • OpenAI: GPT‑4, GPT‑4 Turbo, GPT‑4o.
  • Anthropic: Claude 3.5 Sonnet, Opus.
  • Google: Gemini 2.0, Gemini Pro.
  • Alibaba: Qwen‑Max, Qwen‑Turbo.
  • DeepSeek: DeepSeek‑V3, DeepSeek‑Coder.
  • Open‑source: Llama 3, Mixtral, and more.

Roadmap

From research prototype to enterprise‑grade platform

Q1 2026 In progress

PoC & deep R&D

  • ✓ Initial orchestration engine and parallel executor.
  • ✓ Prototype consensus & contradiction modules.
  • → First experiments across 3 LLMs (GPT‑4, Claude, Qwen).
  • → Baseline accuracy & latency benchmarks.
Q2 2026 Planned

Design partner pilots

  • 3–5 design partners in finance, software, and e‑commerce.
  • Production‑like contradiction detection rollout.
  • v1 of adaptive model weighting in real user flows.
  • First ROI case studies from pilot customers.
Q3 2026 Beta

Public beta

  • Full MACO pipeline (all 9 stages) in production shape.
  • Orthogonalized criteria via PCA/cluster pipelines.
  • Web UI + stable REST/GraphQL API.
  • Open‑source core on GitHub, 5–10 paying customers.
Q4 2026 Production

Enterprise launch

  • Enterprise SLAs (99.9% uptime, support windows, SLOs).
  • Observability stack with Prometheus/Grafana dashboards.
  • Kubernetes‑based autoscaling and blue‑green deployments.
  • Vertical solution packs (finance, DevOps, risk & compliance).
  • Public ROI case with detailed metrics.
2027+ Next horizons

Scaling the platform

  • Reinforcement learning for online weight adaptation.
  • Full multimodal support (text + images, later audio/video).
  • Federated learning options for highly regulated data.
  • Domain‑specific configurations (healthcare, law, security).
  • Partner ecosystem and global integrator program.

Where MACO shines

High‑impact domains for multi‑model consensus

Financial analysis

Earnings analysis, valuation, portfolio recommendations, and risk modeling where every mistake is expensive.

Example: Analyze 40 companies in ~10 minutes instead of 3 analyst‑days, with a full reasoning trace.

Software engineering

Code review, test planning, migration strategies, and architecture decisions driven by multiple specialized “reviewer” models.

Example: Review 20 PRs in parallel with 5 perspectives (security, performance, style, logic, tests).

Legal & compliance

Contract review, due diligence, and regulatory checks where contradiction detection between clauses is critical.

Example: Validate 100 contracts for policy compliance in ~1 hour with flagged contradictions.

E‑commerce & CX

Support ticket triage, review analysis, content moderation, and personalization at marketplace scale.

Example: Process 1,000 customer reviews and surface root causes in ~5 minutes.

Scientific workflows

Literature reviews, hypothesis generation, and peer‑review support with transparent multi‑agent reasoning.

Example: Systematic review of 200 papers in ~2 hours with extracted key findings.

Cybersecurity

Log analysis, threat detection, and incident summarization with focus on recall and low false‑negatives.

Example: Analyze 10K security events in ~30 seconds and prioritize critical incidents.

Investment opportunity

The LLM orchestration market is projected to reach $2.5B by 2027 (CAGR ≈67%). MACO occupies a unique position: consensus‑grade quality at radically lower cost.

$500K
Seed round (Q2 2026)

R&D, pilots, initial 5‑person core team.

$3M
Series A (Q4 2026)

Scaling, GTM, enterprise sales and support.

3–5 yrs
Exit horizon

Strategic exit ($50M+ valuation) or path to IPO.

Why now?

✓ Clear pain in production

Enterprises have experimented with LLMs. Most now face accuracy, explainability, and cost ceilings — and are actively looking for compound AI solutions.

✓ Strong scientific foundation

At least four key 2025–2026 papers show that multi‑agent reasoning and consensus outperform single‑model setups on complex benchmarks.

✓ Technological readiness

Mature LLM APIs, reliable cloud infra, and proven orchestration patterns make now the right time to productize consensus‑grade AI.

✓ Experienced founder

20+ years in software engineering and production AI systems, including CAIS‑style multi‑agent architectures.

Discuss investment

Team

Hands‑on LLM engineering and enterprise AI experience

MD

Mikhail Deynekin

Founder & Chief Architect

20+ years in software engineering, LLM orchestration, and multi‑agent systems. Author of the CAIS and MACO architectures.

We’re hiring

Senior Backend Engineer (Python)

Asyncio, PostgreSQL, Redis, Docker, production APIs.

ML / AI Engineer

LLM fine‑tuning, embeddings, vector DBs, evaluation.

DevOps / Platform Engineer

Kubernetes, CI/CD, observability, cost optimization.

Product Manager

Enterprise B2B, AI products, roadmap and discovery.

Ready to explore MACO?

Reach out to talk about investments, partnerships, or pilot deployments in your organization.