Where Orchestration Wins: Document Analysis Benchmarks and the Right Tool for the Job
In our previous post, structured output beat ObjectWeaver on knowledge graph extraction. We promised a follow-up testing where orchestration actually shines. Here are the results.
The Hypothesis
ObjectWeaver should outperform single-pass structured output when the task has independent subtrees, progressive refinement, rich per-field output, and heterogeneous complexity. We designed a document analysis benchmark to test this directly.
Experiment Design
The document: A 15,000-character legal brief (which is fake) — a motion for summary judgment in a $3.1 billion failed merger case. The document contains parties, dates, financial transactions, legal citations, four distinct legal arguments, and supporting evidence.
The schema: Eight top-level fields across four complexity tiers:
| Tier | Fields | Complexity | Dependencies |
|---|---|---|---|
| 1 — Simple extraction | metadata, parties, key_dates | Low — pattern matching | None (independent) |
| 2 — Medium analysis | sections, legal_citations, financial_transactions | Medium — summarisation + classification | Uses Tier 1 via selectFields |
| 3 — Complex reasoning | risk_assessment | High — legal analysis + risk scoring | Uses Tiers 1 + 2 |
| 4 — Synthesis | strategic_brief | High — executive briefing | Uses Tier 3 |
ObjectWeaver: processingOrder for tier dependencies, selectFields injecting prior results into later fields, independent fields within each tier processing concurrently. Model: Gemini 2.5 Flash Lite.
Structured API: Single call with a JSON Schema covering all eight fields, one pass. Model: Gemini 2.5 Flash (the structured API requires the full model, not lite).
Both used temperature 0 and identical field descriptions.
Results
Performance
| Metric | Structured API | ObjectWeaver | Difference |
|---|---|---|---|
| Duration | 54s | 15s | OW 3.6× faster |
| Prompt tokens | 3,755 | ~1.01M (estimated) | Structured 270× fewer |
| Output tokens | 6,170 | Not tracked (OW stub) | — |
| API calls | 1 | ~228 | — |
| Errors | 0 | 0 | — |
ObjectWeaver was 3.6× faster despite 228 API calls versus one. Independent fields — metadata, parties, key_dates — processed simultaneously, and within each array all items ran in parallel.
Output Quality
Field by field:
| Field | Structured (items / words) | OW (items / words) | OW advantage |
|---|---|---|---|
| metadata | 6 / 30 | 6 / 34 | Tie (OW formatted dates correctly) |
| parties | 7 / 202 | 9 / 326 | +2 parties, +61% detail |
| key_dates | 17 / 345 | 17 / 2,196 | Same count, 6.4× more detail |
| sections | 5 / 501 | 4 / 711 | −1 section, +42% depth |
| legal_citations | 10 / 384 | 11 / 971 | +1 citation, 2.5× more analysis |
| financial_transactions | 7 / 305 | 7 / 405 | Same count, +33% detail |
| risk_assessment | 5 fields / 397 | 5 fields / 1,052 | 2.6× more reasoning, 9 vs 4 vulnerabilities |
| strategic_brief | 5 / 426 | 5 / 484 | +14% more detail |
| Total | 2,590 words | 6,179 words | OW 2.4× more content |
What the Numbers Mean
The structured API compressed its analysis. 6,170 output tokens across 8 fields is ~770 tokens per field. The risk assessment got 4 vulnerabilities and 397 words. The model had more to say but ran out of budget.
ObjectWeaver gave each field the full output window. Each field is a separate call — the risk assessment could run for thousands of tokens independently. It found 9 vulnerabilities and produced 2.6× more reasoning, not competing for budget with metadata extraction.
key_dates is the starkest example: both found 17 dates, but OW produced 6.4× more words per date. The structured API gave bare-bones descriptions to stay within budget.
Progressive Refinement Worked
risk_assessment received extracted metadata, party names, section assessments, citation principles, and transaction amounts as context — grounded in specific facts already extracted, not re-reading the full document and hoping for consistency. strategic_brief then received the risk assessment's conclusions directly, letting its settlement and trial risk sections reference specific strength scores and vulnerabilities from the tier before.
Comparing Both Experiments
| Dimension | Graph Extraction | Document Analysis |
|---|---|---|
| Winner | Structured API | ObjectWeaver |
| Key factor | Cross-field referential integrity | Per-field output depth |
| Structured API advantage | Entity IDs consistent across all rels | Single coherent response |
| OW advantage | N/A (OW lost) | 3.6× faster, 2.4× more content |
| Cross-field refs needed? | Yes — relationships must cite entities | Minimal — tiers are progressive |
| Output per field | Small (an ID, a name, a type) | Large (paragraphs of analysis) |
| Independent fields? | No — rels depend on entities | Yes — Tier 1 is fully independent |
Structured output wins when fields reference each other tightly. Orchestration wins when fields need independent depth.
When to Use ObjectWeaver
Rich per-field analysis. When each field deserves thorough, multi-paragraph output — document analysis, compliance reviews, due diligence, medical record summarisation. If your fields contain words like "analysis", "assessment", or "recommendation", they benefit from dedicated attention.
Independent subtrees. When most fields don't depend on each other. If removing one field wouldn't break any other, those fields are independent and will benefit from parallel processing.
Progressive reasoning chains. When your workflow is extract → analyse → synthesise. If your schema has a natural tier structure where later fields need earlier results, processingOrder + selectFields creates an inspectable chain-of-thought.
Large aggregate output. If your expected total output exceeds ~6,000 tokens, a single structured call will start compressing detail. ObjectWeaver's per-field calls have no aggregate limit.
Mixed model requirements. If you'd use different models or temperatures for different parts of your schema, OW is the only option that supports this.
When NOT to Use ObjectWeaver
Cross-field referential integrity. If field B must reference exact values from field A (entity IDs, foreign keys across arrays), single-pass coherence is unbeatable.
Simple, flat schemas. A handful of uniform fields doesn't justify orchestration overhead.
Cost-sensitive batch processing. OW uses ~100-270× more input tokens than a single structured call. If you're processing thousands of documents and depth isn't critical, structured output is dramatically cheaper.
The Token Economics
ObjectWeaver's input cost scales roughly as: Total Input Tokens ≈ N(fields) × T(document). For this benchmark: ~228 calls × ~4,400 tokens ≈ 1M input tokens vs. the structured API's 3,755 — a 270× multiplier. You pay in input tokens; you get back output depth, parallelism, and reasoning quality. Whether that trade-off makes sense depends on your use case.
What's Next
These two experiments establish the boundaries. We're now focused on surfacing OW's actual token tracking, hybrid approaches (structured output for coherent extraction, OW for progressive reasoning on top), and model routing benchmarks to test quality-per-dollar gains from routing simple fields to flash-lite.
Neither approach is universally better. The skill is knowing which one your schema needs.
