Multi-Agent Systems Under Cost Constraints

Introduction: The Reliability-Cost Tradeoff

Single-agent systems are fast and cheap. Run a query, get an answer, move on. The problem: they're unreliable. A 10% error rate might be tolerable for email classification. It's catastrophic for trade execution.

The obvious fix: verification. Run multiple agents, cross-check results, add validation layers. This works—error rates drop—but costs explode. Three agents verifying every decision means 3x API bills and 3x latency. At scale, verification becomes the bottleneck.

Most production systems operate between these extremes. You can't afford to verify everything, and you can't afford not to verify anything. The question isn't "should we verify?" but "which decisions are worth verifying, and how much should we spend doing it?"

This is where constraints become useful. Without cost pressure, you'd triple-check every output. With it, you're forced to think strategically: structured validation for routine decisions, Actor-Critic for moderate-stakes work, triangulation for high-value calls. The constraint drives the design.

This piece covers:

Independent work patterns that make verification affordable (parallel execution, task decomposition)
Verification architectures from light (structured validation) to heavy (triangulation)
Cost frameworks for calculating ROI on verification spend
Practical patterns and a maturity ladder for getting started

The goal: reliable output within cost constraints. Not perfection—intelligent risk management.

Independent Work Patterns: Why They Matter for Cost Efficiency

Verification becomes expensive when agents wait on each other. Sequential validation—agent A finishes, agent B checks, agent C double-checks—compounds latency and burns budget on idle time. Independent work patterns break this bottleneck.

Parallel Execution

Multiple agents work simultaneously. Instead of scanning 10 market sectors sequentially (50 minutes at 5 min/sector), you fan out to 10 agents (5 minutes total). You pay for 10x API calls, but you get 10x speed.

When it pays off: High-frequency decisions where time has value. Market data that's stale after 10 minutes. Customer queries where response time affects conversion. Content moderation where backlogs create risk.

Cost structure:

Sequential: 10 tasks × 5 min × $1 = $10, 50 minutes
Parallel: 10 tasks × $1 (simultaneously) = $10, 5 minutes

Same cost, 10x faster. The tradeoff: orchestration complexity (managing 10 concurrent tasks) and potential resource contention (rate limits, memory).

Task Decomposition

Break complex workflows into independent sub-tasks, each with its own verification budget. Not everything needs the same level of scrutiny.

Trading pipeline example:

Scanner (light verification): Check 50 stocks for signals → structured validation only (schema, bounds)
Analysis (medium verification): Deep-dive on 5 flagged stocks → Actor-Critic pattern
Execution (heavy verification): Risk check before $50k trade → triangulation (3 agents must agree)

Total verification cost: $0.50 (scanner) + $5 (analysis) + $15 (execution) = $20.50

Compare to uniform verification (triangulation on everything): 50 stocks × $15 = $750. Task decomposition saves $730 by matching verification spend to decision value.

Avoiding Sequential Bottlenecks

Traditional validation creates waterfalls: agent proposes → critic validates → executor acts. If the critic is slow or the executor is blocked, everything stalls.

Async validation pattern:

Agent proposes decision
Execution starts (low-risk actions)
Validator checks in parallel
If validation fails, rollback or escalate

This works for decisions with low rollback cost. Example: Document classification can be corrected cheaply if validation catches an error after initial classification. Trade execution cannot—async validation doesn't work there.

Verification Architectures: The Spectrum

Verification isn't binary. Between "no checking" and "full triangulation" lies a spectrum of architectures, each with different cost-reliability tradeoffs.

Structured Validation (Cheapest)

Rule-based checks: schema validation, range constraints, type checking, sanity tests. No LLM calls—just code.

What it catches:

Format errors (malformed JSON, missing required fields)
Constraint violations (negative prices, dates in the future, portfolio weights > 1.0)
Obvious mistakes (temperature in Kelvin when Celsius expected)

What it misses:

Logic errors (correct format, wrong calculation)
Context errors (valid data, inappropriate for situation)
Subtle mistakes (off-by-one, edge cases)

Cost: Negligible. Milliseconds of compute, no API calls.

ROI: Extremely high for well-defined outputs. Catches 60-80% of errors for <1% of verification cost.

Example: Financial calculation validation

def validate_portfolio(weights, prices, positions):
    assert sum(weights) == 1.0, "Weights must sum to 1.0"
    assert all(w >= 0 for w in weights), "Weights must be non-negative"
    assert all(p > 0 for p in prices), "Prices must be positive"
    assert len(weights) == len(positions), "Dimension mismatch"
    return True

Cost: ~0.1ms. Catches dimensional errors, impossible values, basic math mistakes. Doesn't catch: incorrect risk model, wrong asset correlations, flawed assumptions.

Pattern: Always start here. Structured validation is your first line of defense. It's fast, cheap, and catches the low-hanging fruit.

Actor-Critic Pattern (Medium Cost)

One agent proposes, another critiques. The proposer generates a solution; the critic evaluates it for errors, edge cases, and logical consistency.

Architecture:

Proposer (GPT-4o): "Execute buy order for AAPL, 100 shares, market order"
Critic (GPT-4o-mini): "Risk check: Position would exceed 10% portfolio allocation. 
                       Recommend reduce to 50 shares."

Cost structure:

Proposer: $5/call (complex reasoning, full context)
Critic: $0.50/call (focused validation, smaller model)
Total: $5.50 (vs $8+ for single careful agent)

What it catches:

Logic errors (proposer's reasoning is flawed)
Missed edge cases (overlooked failure modes)
Inconsistencies (contradicts prior decisions or constraints)

ROI sweet spot: Decisions worth $100-$10k where full triangulation is overkill but single-agent is too risky.

Pattern: Use cheaper model for critic. It doesn't need to generate solutions, just spot problems. GPT-4o-mini or Claude Haiku work well for criticism.

Triangulation (Higher Cost, Higher Confidence)

Multiple agents solve the same problem independently. Compare results: agreement builds confidence, divergence signals uncertainty.

Architecture:

Agent A: Calculates portfolio VaR = $47,250
Agent B: Calculates portfolio VaR = $48,100
Agent C: Calculates portfolio VaR = $46,950

Convergence check: Max deviation = 2.4% → within tolerance, proceed
Divergence scenario: A=$47k, B=$53k, C=$48k → flag for human review

Cost structure:

3 agents × $5/call = $15
Aggregation logic: negligible
Total: $15 (vs $5 single agent, 3x cost)

What it catches:

Model-specific biases (GPT-4o vs Claude vs Gemini handle edge cases differently)
Reasoning errors (one agent makes a mistake, others don't)
Hallucinations (one agent invents data, others use real numbers)

When worth it:

High-stakes decisions ($10k+ impact)
Scarce expertise domains (complex risk calculations, specialized analysis)
Irreversible actions (regulatory filings, large trades, public communications)

Pattern: Don't triangulate everything. It's expensive. Use it for critical path decisions where error cost justifies 3x verification spend.

Cost Frameworks: When Verification Pays Off

Verification isn't free, and errors aren't free. The question is: which is more expensive?

The ROI Calculation

The formula is straightforward:

ROI = (Error_Cost × Error_Rate_Without) - (Error_Cost × Error_Rate_With) - Verification_Cost

Positive ROI: Verify. You save more than you spend.
Negative ROI: Skip verification. Accepting errors is cheaper than preventing them.

Example: Trading signal verification

Without verification:

Single agent: $5/decision
Error rate: 10%
Average trade value: $10,000
Error cost: $500 (5% slippage on wrong trades)
Expected error cost: 10% × $500 = $50/decision

With Actor-Critic verification:

Proposer + critic: $6/decision ($5 + $1)
Error rate: 2% (measured after 1,000 decisions)
Expected error cost: 2% × $500 = $10/decision
Net benefit: $50 - $10 - $1 = $39 saved per decision

At 100 decisions/day: $3,900/day savings from $100/day verification spend. ROI: 3,800%

Decision Value Tiers

Not all decisions deserve the same verification budget. Allocate spend based on impact.

Tier 1: Low-value decisions (< $100 impact)

Examples: Email classification, document tagging, calendar scheduling, routine queries
Error cost: Low (reclassification takes 30 seconds)
Verification budget: Minimal ($0.01-0.10)
Architecture: Structured validation only (schema, bounds, sanity checks)

Tier 2: Medium-value decisions ($100-$10k impact)

Examples: Content moderation, risk flagging, preliminary analysis, customer support escalation
Error cost: Moderate (wasted effort, customer frustration, missed issues)
Verification budget: $0.50-5.00
Architecture: Actor-Critic (lightweight validator catches most errors)

Tier 3: High-value decisions (> $10k impact)

Examples: Trade execution, contract signing, regulatory reporting, strategic recommendations
Error cost: High (financial loss, legal exposure, reputation damage)
Verification budget: $10-100+
Architecture: Triangulation (3-5 agents, escalate on divergence) or human-in-loop

Frequency Economics

Decision frequency affects verification affordability. High-frequency decisions need cheaper verification or costs spiral.

1000x/day decisions: Budget constraint: < $0.10/verification or daily cost exceeds $100. Pattern: Structured validation only, maybe Actor-Critic with mini models.
10x/day decisions: Budget: $5-10/verification = $50-100/day. Pattern: Actor-Critic comfortable, selective triangulation.
1x/week decisions: Budget: $50-100/verification = $200-400/month. Pattern: Full triangulation, extensive human review.

As frequency increases, shift verification architecture toward cheaper methods. At 1000x/day, you can't afford triangulation on everything—structured validation becomes your primary defense.

Practical Patterns & Anti-Patterns

Patterns That Work

✓ Verify critical path, trust supporting tasks
Trading pipeline: Heavy verification on execution (triangulation), light verification on data ingestion (structured). Spend budget where errors hurt most.

✓ Use cheaper models for criticism
Proposer: GPT-4o ($5). Critic: GPT-4o-mini ($0.50). The critic doesn't need full reasoning power—it's checking for obvious mistakes. Save 80% on verification cost without sacrificing much accuracy.

✓ Structured validation first
Before running expensive LLM verification, catch format errors and constraint violations with code. Structured validation catches 60-80% of errors for <1% of cost. It's your filter—only pass validated outputs to LLM critics.

✓ Escalate divergence, not every result
Triangulation: If 3 agents agree, proceed automatically. If they diverge, escalate to human. Human time is expensive—use it for uncertain cases, not routine agreement.

Anti-Patterns to Avoid

✗ Uniform verification
Treating $10 and $10,000 decisions the same. Wastes budget on over-verification (low-stakes) or under-verification (high-stakes). Tier your decisions and match verification spend to impact.

✗ Sequential verification gates
Waterfall bottlenecks: agent A proposes → agent B validates → agent C double-checks → agent D executes. Latency compounds, costs stack, throughput collapses. Parallelize where possible.

✗ Over-verification
Triple-checking email classification or document tagging. If error cost is $0.50 and verification costs $5, you're burning money. Structured validation is sufficient for low-stakes work.

✗ No measurement
Running verification without tracking error rates or costs. You can't optimize what you don't measure. Log errors, calculate costs, compute ROI. Adjust based on data, not assumptions.

Getting Started: Verification Maturity Ladder

You don't need to implement everything at once. Start simple, measure, expand based on ROI.

Stage 1: Single agent + structured validation

Start here for low-stakes work
Add schema checks, bounds validation, sanity tests
Measure baseline error rates (what percentage of outputs have mistakes?)
Cost: Negligible (no LLM verification yet)
Time: 1-2 days to implement

Stage 2: Actor-Critic on high-value decisions

Identify decisions worth $100+ in impact
Add lightweight critic agent (GPT-4o-mini or Claude Haiku)
Measure error rate improvement (before/after verification)
Calculate ROI: error cost reduction vs verification spend
Cost: $0.50-2.00/decision
Time: 3-5 days to implement and calibrate

Stage 3: Selective triangulation

For decisions worth $10k+: add triangulation (3-5 agents)
Implement convergence checks and escalation paths
Track divergence rates (how often do agents disagree?)
Measure impact (error reduction, escalation quality)
Cost: $10-50/decision
Time: 1-2 weeks to build orchestration and aggregation logic

Stage 4: Adaptive verification

Dynamic verification level based on agent confidence
High confidence → structured validation only
Medium confidence → Actor-Critic
Low confidence → triangulation or human escalation
Requires calibration: confidence scores must correlate with actual accuracy
Cost: Variable ($0.10-50/decision based on confidence)
Time: 2-4 weeks (requires data collection and model calibration)

Most production systems operate at Stage 2-3. Stage 4 (adaptive) is advanced—useful for high-volume systems where verification cost compounds quickly.

Conclusion: Constraints Drive Design

Verification isn't about achieving perfection. It's about intelligently reducing error rates within cost constraints.

Single agents are fast and cheap but unreliable. Full verification is reliable but slow and expensive. The art is operating in the middle: structured validation for routine work, Actor-Critic for moderate stakes, triangulation for critical decisions.

Key takeaways:

Independent work patterns (parallel execution, task decomposition) make verification affordable by eliminating sequential bottlenecks
Verification architectures span a spectrum—match architecture to decision value
ROI calculation is straightforward: error cost reduction minus verification cost
Decision value tiers guide budget allocation (< $100 → structured, $100-10k → Actor-Critic, > $10k → triangulation)
Frequency economics constrain per-decision cost (1000x/day can't afford $5/verification)

The constraint isn't the enemy. It's the forcing function that drives good design. Without cost pressure, you'd verify everything and burn budget. With it, you're forced to think: Which errors actually matter? Where should I spend verification budget? What's the cheapest way to catch 80% of mistakes?

Next steps:

Measure baseline error rates. Run your single-agent system, log outputs, manually review a sample. What percentage has mistakes?
Calculate error costs. What's the impact of a wrong decision? Wasted time? Financial loss? Customer churn?
Design verification architecture. Map decisions to value tiers. Choose verification patterns (structured, Actor-Critic, triangulation) based on ROI.
Track improvements. Measure error rates after verification. Compute ROI: error cost reduction vs verification spend.
Iterate. Adjust verification levels based on data. Over-verifying low-stakes work? Scale back. Under-verifying high-stakes decisions? Add triangulation.

Production reliability comes from intelligent tradeoffs, not unlimited budget. Verification that scales is verification that pays for itself.