When Structured Output Beats Orchestration — And When It Doesn't
We ran an honest experiment comparing ObjectWeaver against the Gemini Structured Output API for knowledge graph extraction from interview transcripts. The structured API won. This post explains why — and where orchestration actually earns its keep.
The Experiment
The task: extract a typed entity-relationship graph from legal market intelligence interviews — lawyer movements, client relationships, case work, practice area expertise. 7 entity types, 10 relationship types. Same prompts, same ontology, same ~2,000-word synthetic interview split into 7 chunks, same model (Gemini 2.5 Flash Lite).
ObjectWeaver: entities extracted as an array field, then 10 relationship arrays with selectFields injecting entity data and processingOrder enforcing sequencing. Each field and array item is a separate LLM call — roughly 47 calls per chunk.
Structured API: one JSON Schema covering the full entity + relationship structure. One call per chunk — 7 total.
Results
| Metric | ObjectWeaver | Gemini Structured API |
|---|---|---|
| Entities | 95 | 116 |
| Relationships | 111 | 187 |
| Referential integrity | ~75% | 100% |
| Duration | 28s | 40s |
| API calls | ~329 (47 × 7 chunks) | 7 |
| Estimated cost | ~$0.01–0.02 | ~$0.01 |
The structured API produced 68% more relationships with perfect referential integrity. ObjectWeaver dropped roughly 25% of its relationships during postprocessing — the LLM generated entity names that didn't match the extracted list.
Why the Structured API Won
Cross-field coherence. The structured API generates entities and relationships in one response. The model holds the full entity list in working memory while writing relationships. ObjectWeaver processes each field as a separate call — relationship fields receive entity names via selectFields, but each source/target is extracted independently, losing holistic context.
Schema-level enforcement. The Gemini Structured API constrains output at the token level — the model can't produce malformed JSON or invalid enum values. ObjectWeaver relies on prompt instructions ("ONLY use exact entity names"), which the model frequently ignores, with fuzzy matching in postprocessing to recover what it can.
Natural deduplication. In a single structured call, the model sees all entities it has already written. ObjectWeaver's entity array generates each item separately — the model can't see prior extractions, producing variants like Freshfields and Freshfields Bruckhaus Deringer as distinct entries.
Token efficiency. 7 API calls vs. 329. Each ObjectWeaver call resends the system prompt, chunk text, and injected context. More tokens, worse output.
What Orchestration Is Actually For
This result doesn't mean orchestration is pointless — it means graph extraction is the wrong benchmark for it. Tight cross-field coherence is exactly what single-pass structured output is built for.
ObjectWeaver's field-level decomposition is designed for different problems:
Exceeding output token limits. A single structured API call produces at most a few thousand tokens. When extracting hundreds of entities from a long document, the response truncates. ObjectWeaver's decomposition means each item is a separate call — total output size is unbounded.
Mixed model requirements. A simple classification field and a nuanced legal analysis field shouldn't use the same model at the same temperature. ObjectWeaver routes each field independently.
Progressive multi-step reasoning. When later fields genuinely depend on earlier analysis — "given the extracted entities, assess their litigation risk" — processingOrder + selectFields creates structured chain-of-thought with inspectable intermediate results.
Heterogeneous schemas. Independent subtrees (summary, sentiment, key quotes, action items) can process concurrently with different configurations. A single structured call forces everything through one pass.
The Honest Position
For extracting a coherent graph from a document that fits in a single context window: use structured output. Simpler, cheaper, better results.
For schemas where different parts need different models, output exceeds token limits, fields need progressive refinement, or subtrees are genuinely independent — that's where orchestration earns its keep.
We're running a follow-up experiment designed to test those conditions directly.
Next Experiment: Heterogeneous Document Analysis
To test where orchestration actually outperforms single-pass structured output, we're designing an experiment with a schema that favours it:
- Large, heterogeneous output — thousands of tokens across diverse field types, past single-call limits
- Mixed complexity — some fields need simple classification (cheap model), others need nuanced reasoning (expensive model)
- Independent subtrees — fields that don't reference each other, enabling parallel processing
- Progressive refinement — later fields that depend on earlier output, testing
selectFields+processingOrderchains
The task: comprehensive analysis of a long-form legal brief (~5,000–10,000 words). The schema:
- Metadata extraction (simple, cheap): document type, jurisdiction, date references, parties
- Section summaries (medium): key arguments, cited precedents, statutory references
- Risk assessment (complex, expensive): argument strength, likelihood of success, strategic vulnerabilities — using metadata and summaries as context via
selectFields - Comparable case analysis (complex, depends on risk assessment): similar cases, predicted outcomes
- Executive summary (medium, depends on all above): synthesised structured briefing
We'll measure the same metrics — accuracy, referential integrity between dependent fields, cost, latency, and output completeness — and report the results honestly.
Stay tuned.
