Jugalbandi Protocol: What Happens When You Force AI Agents to Argue
In a previous post, I argued that multi-agent systems are panels, not bands. They run agents independently and merge results — missing the genuine interaction where agents shape each other's responses. I proposed jugalbandi — the Hindustani classical music duet — as an architectural model: shared constraints, call-and-response, periodic convergence, engineered closure.
That was the argument. This year I'm holding myself to a higher standard — if I'm going to make structural claims about how agents should work, I should build the thing and measure it. This is the first experiment.
Every multi-agent framework has the same pitch: split the work, let agents collaborate, get better results. CrewAI, AutoGen, Squad — they all let you define roles and wire them together. But I keep running into the same failure mode: the agents agree with each other. Not because they're right. Because the system never forces them to disagree.
When you ask an LLM to plan something, it makes decisions. "Build a login page" becomes "React + JWT + PostgreSQL" in the first paragraph. Nobody asked for those choices. The model filled the envelope — it took an ambiguous task and silently narrowed it into a specific implementation. Ask it to self-critique and it'll find surface-level issues, then resolve them by making more silent choices. The critique is real but the resolution is still one mind talking to itself.
I wanted to test whether structure — not just more compute — could fix this. Specifically: does forcing assumption declaration and adversarial challenge produce measurably different outputs than single-pass execution? Or is multi-agent dialectics just expensive self-reflection?
The Experiment
I built a protocol called Jugalbandi (the Hindustani classical music term for a duet between two musicians, each pushing the other). Three roles, one model (Claude Sonnet 4), isolated contexts:
The Proposer receives the task and produces an implementation plan. Crucially, it must declare every assumption — every technology choice, scope decision, or default behavior that wasn't explicitly required.
The Challenger receives only the Proposer's output (never the original task or the Proposer's system prompt). Its job is adversarial review. It must produce at least three challenges, each tagged as [STRUCTURAL], [ASSUMPTION], or [MISSING]. It cannot approve anything.
The Resolver receives both the proposal and the challenges (but neither system prompt). For every challenge, it must state a disposition: accepted with a specific revision, rejected with justification, or escalated as needing human input.
The key design choice is context isolation. The Challenger never sees the Proposer's instructions. The Resolver sees both outputs but neither set of instructions. This isn't persona roleplay — it's structural separation. Each role operates with different information, which constrains the kind of reasoning it can do.
I ran five tasks through three paths to compare:
- Single-pass: One model call — plan, list assumptions, self-critique, revise.
- Council: Three independent single-pass calls run in parallel — same compute budget as Jugalbandi, no structural constraints.
- Jugalbandi: The three-role loop described above.
The tasks were chosen for genuine ambiguity — where the envelope will get filled: building a login page, CI/CD for a Python monorepo, multi-tenant SaaS data modeling, real-time notifications on a REST API, and zero-downtime PostgreSQL migration.
Results
Here's what came back across all five tasks:
| Metric | Single-Pass | Council (3x) | Jugalbandi |
|---|---|---|---|
| Assumptions surfaced (avg) | 10.2 | 14.6 | 25.2 |
| Critiques/challenges (avg) | 4.8 | 12.8 | 7.2 |
| Escalations (total) | 0 | 0 | 3 |
Three observations.
The Proposer tries harder when it knows it'll be challenged. Jugalbandi surfaced 25.2 assumptions on average versus the council's 14.6 — despite both using the same model and the same total number of API calls. The council's three independent passes found overlapping assumptions. The Proposer, structurally aware that a Challenger exists, front-loaded its assumption declarations. The structure changed the model's behavior.
This is the same dynamic that makes a real jugalbandi transcendent. In Hindustani classical music, a jugalbandi isn't just two musicians taking turns — it's the anticipation of the response that elevates the performance. A sitarist plays differently when they know a sarangi player will answer. They declare more, risk more, leave less implicit — because they know the silence will be filled by someone else's interpretation, not their own. The Proposer, like the first musician, performed at a higher level not because of a better prompt, but because the structure demanded it.
The council finds more critiques but they're shallow. The council produced 12.8 critiques per task versus Jugalbandi's 7.2. But the council's critiques are three copies of self-critique — they overlap heavily and self-resolve immediately. Jugalbandi's challenges are tagged by type (structural, assumption, missing) and go through a separate Resolver who must disposition each one. When I looked at the raw outputs, the council critiques tended to identify an issue and fix it in the same breath. The Challenger identifies issues; it doesn't get to fix them. That separation matters.
Only Jugalbandi produced escalations. Across all five tasks, neither single-pass nor council ever said "I can't decide this — ask a human." They always filled the envelope. Jugalbandi produced three escalations across two tasks:
- CI/CD: The Resolver escalated monitoring tooling choice (which observability stack?) and business continuity requirements (RTO/RPO for the pipeline itself). The single-pass baseline mentioned monitoring as a flaw but silently picked "add build time metrics" without asking which stack the organization uses.
- Multi-tenant SaaS: The Resolver escalated data residency and compliance requirements. The single-pass baseline assumed "single region deployment initially" and "no HIPAA/SOX requirements" — decisions it had no authority to make.
This is the clearest signal. The single-pass model and the council both noticed these gaps. But they noticed them and immediately resolved them by assuming. The Jugalbandi Resolver, receiving challenges from a structurally separated Challenger, correctly identified that these were decisions requiring organizational context and refused to assume.
The Structural Claim
The value isn't in the personas. Telling a model "you are an adversarial reviewer" doesn't reliably produce adversarial review — coherence bias pulls it back toward agreement. The value is in the context isolation. The Challenger literally cannot see the Proposer's instructions, so it can't align with them. The Resolver sees both outputs but neither set of instructions, so it evaluates the arguments on their merits.
This maps to something I've been thinking about with agent architectures: reads and writes should be governed differently. Reading data, summarizing documents, answering questions — these are low-risk operations where a single pass is fine. But when an agent is making decisions that narrow the solution space — choosing a technology, scoping a requirement, setting a default — that's a write operation. Writes deserve the dialectical gate. Not every task needs the three-role loop, but every task that fills an envelope should go through one.
What I'd Build Next
The obvious next step is calibration: which tasks actually need the Jugalbandi loop and which are fine with single-pass? The login page task surfaced 38 assumptions — clearly high envelope. But a task like "add a health check endpoint" probably doesn't need adversarial review. The protocol needs a lightweight classifier that routes tasks to the appropriate level of scrutiny. The latency and cost tradeoff is real — three sequential Opus calls aren't free — but for write-path decisions, the cost of a wrong assumption is higher than the cost of the loop.
The code is ~200 lines of TypeScript. The protocol is deliberately simple because the thesis is about structure, not sophistication. If you're interested in the code, here is the Github link.
Update: I Ran the Ablation That Could Have Killed My Claim
After publishing, I got a critique I couldn't dismiss: the escalation result — my strongest evidence — had a confound. The Resolver's system prompt includes an explicit escalation option. The single-pass baseline's prompt doesn't. Of course the baseline never escalated. It was never given the behavioral instruction to do so.
The question wasn't whether the critique was right — it obviously was. The question was whether fixing the confound would kill the structural claim.
Methodology
To isolate the variable, I changed one thing: the single-pass baseline prompt. The original asked the model to self-critique and revise. The updated version adds the same dispositional language as the Resolver:
For each critique, state whether you accepted it (with revision), rejected it (with justification), or escalated it (needs human input — you don't have enough context to decide). Do not default to resolving every critique yourself.
Everything else held constant — same model (Claude Sonnet 4), same five tasks, same council and Jugalbandi prompts. The exact prompt diff is in the repo.
What Happened
| Task | Baseline Escalations (updated) | Jugalbandi Escalations |
|---|---|---|
| 1. Login | 1 | 0 |
| 2. CI/CD | 1 | 0 |
| 3. Multi-tenant | 2 | 0 |
| 4. Notifications | 2 | 0 |
| 5. Migration | 3 | 0 |
| Total | 9 | 0 |
The original escalation claim is dead. When given the same behavioral instruction, the single-pass baseline escalated more than Jugalbandi ever did — 9 times across 5 tasks versus the original run's 3. The Jugalbandi Resolver, in the same re-run, escalated zero times.
What This Actually Means
The single-pass model, critiquing itself in isolation, is uncertain. It sees gaps — monitoring tooling, data residency, deployment strategy — but can't resolve them with confidence because it's arguing with itself. So when you give it permission to escalate, it takes the exit.
The Jugalbandi Resolver operates differently. By the time it receives the Challenger's output, the uncertainty has been made concrete. "You assumed single-region deployment, but data residency requirements weren't specified" is a specific enough challenge to accept or reject on its merits. The adversarial framing makes the decision tractable. The Resolver doesn't escalate because it doesn't need to — the Challenger already did the work of turning vague doubt into an actionable question.
What Survived
The assumption surfacing gap held. Updated numbers across all five tasks:
| Metric | Single-Pass | Council (3x) | Jugalbandi |
|---|---|---|---|
| Assumptions surfaced (avg) | 9.4 | 14.2 | 25.4 |
That's 2.7x single-pass and 1.8x council. This result doesn't depend on prompting asymmetry — it's the structural effect of anticipating challenge.
Honest Assessment
This is preliminary evidence on five tasks, all selected for high ambiguity, using a single model family. The assumption surfacing effect is consistent and survives the ablation. The escalation story turned out to be about prompting, not structure. The qualitative depth claim — that Jugalbandi's challenges are more actionable than the council's self-critiques — remains my subjective judgment and would need independent evaluation to harden.
What I got wrong: escalation isn't the signal. What I got right: context isolation produces qualitatively different reasoning. The Challenger doesn't just find more issues — it frames them in a way that makes the Resolver's job tractable. That's the jugalbandi. Not the number of escalations, but the quality of the dialogue.