Skip to main content

Where Orchestration Wins: Document Analysis Benchmarks and the Right Tool for the Job

· 6 min read
ObjectWeaver Team
AI Infrastructure Platform

In our previous post, structured output beat ObjectWeaver on knowledge graph extraction. We promised a follow-up testing where orchestration actually shines. Here are the results.

The Hypothesis

ObjectWeaver should outperform single-pass structured output when the task has independent subtrees, progressive refinement, rich per-field output, and heterogeneous complexity. We designed a document analysis benchmark to test this directly.

Experiment Design

The document: A 15,000-character legal brief (which is fake) — a motion for summary judgment in a $3.1 billion failed merger case. The document contains parties, dates, financial transactions, legal citations, four distinct legal arguments, and supporting evidence.

The schema: Eight top-level fields across four complexity tiers:

TierFieldsComplexityDependencies
1 — Simple extractionmetadata, parties, key_datesLow — pattern matchingNone (independent)
2 — Medium analysissections, legal_citations, financial_transactionsMedium — summarisation + classificationUses Tier 1 via selectFields
3 — Complex reasoningrisk_assessmentHigh — legal analysis + risk scoringUses Tiers 1 + 2
4 — Synthesisstrategic_briefHigh — executive briefingUses Tier 3

ObjectWeaver: processingOrder for tier dependencies, selectFields injecting prior results into later fields, independent fields within each tier processing concurrently. Model: Gemini 2.5 Flash Lite.

Structured API: Single call with a JSON Schema covering all eight fields, one pass. Model: Gemini 2.5 Flash (the structured API requires the full model, not lite).

Both used temperature 0 and identical field descriptions.

Results

Performance

MetricStructured APIObjectWeaverDifference
Duration54s15sOW 3.6× faster
Prompt tokens3,755~1.01M (estimated)Structured 270× fewer
Output tokens6,170Not tracked (OW stub)
API calls1~228
Errors00

ObjectWeaver was 3.6× faster despite 228 API calls versus one. Independent fields — metadata, parties, key_dates — processed simultaneously, and within each array all items ran in parallel.

Output Quality

Field by field:

FieldStructured (items / words)OW (items / words)OW advantage
metadata6 / 306 / 34Tie (OW formatted dates correctly)
parties7 / 2029 / 326+2 parties, +61% detail
key_dates17 / 34517 / 2,196Same count, 6.4× more detail
sections5 / 5014 / 711−1 section, +42% depth
legal_citations10 / 38411 / 971+1 citation, 2.5× more analysis
financial_transactions7 / 3057 / 405Same count, +33% detail
risk_assessment5 fields / 3975 fields / 1,0522.6× more reasoning, 9 vs 4 vulnerabilities
strategic_brief5 / 4265 / 484+14% more detail
Total2,590 words6,179 wordsOW 2.4× more content

What the Numbers Mean

The structured API compressed its analysis. 6,170 output tokens across 8 fields is ~770 tokens per field. The risk assessment got 4 vulnerabilities and 397 words. The model had more to say but ran out of budget.

ObjectWeaver gave each field the full output window. Each field is a separate call — the risk assessment could run for thousands of tokens independently. It found 9 vulnerabilities and produced 2.6× more reasoning, not competing for budget with metadata extraction.

key_dates is the starkest example: both found 17 dates, but OW produced 6.4× more words per date. The structured API gave bare-bones descriptions to stay within budget.

Progressive Refinement Worked

risk_assessment received extracted metadata, party names, section assessments, citation principles, and transaction amounts as context — grounded in specific facts already extracted, not re-reading the full document and hoping for consistency. strategic_brief then received the risk assessment's conclusions directly, letting its settlement and trial risk sections reference specific strength scores and vulnerabilities from the tier before.

Comparing Both Experiments

DimensionGraph ExtractionDocument Analysis
WinnerStructured APIObjectWeaver
Key factorCross-field referential integrityPer-field output depth
Structured API advantageEntity IDs consistent across all relsSingle coherent response
OW advantageN/A (OW lost)3.6× faster, 2.4× more content
Cross-field refs needed?Yes — relationships must cite entitiesMinimal — tiers are progressive
Output per fieldSmall (an ID, a name, a type)Large (paragraphs of analysis)
Independent fields?No — rels depend on entitiesYes — Tier 1 is fully independent

Structured output wins when fields reference each other tightly. Orchestration wins when fields need independent depth.

When to Use ObjectWeaver

Rich per-field analysis. When each field deserves thorough, multi-paragraph output — document analysis, compliance reviews, due diligence, medical record summarisation. If your fields contain words like "analysis", "assessment", or "recommendation", they benefit from dedicated attention.

Independent subtrees. When most fields don't depend on each other. If removing one field wouldn't break any other, those fields are independent and will benefit from parallel processing.

Progressive reasoning chains. When your workflow is extract → analyse → synthesise. If your schema has a natural tier structure where later fields need earlier results, processingOrder + selectFields creates an inspectable chain-of-thought.

Large aggregate output. If your expected total output exceeds ~6,000 tokens, a single structured call will start compressing detail. ObjectWeaver's per-field calls have no aggregate limit.

Mixed model requirements. If you'd use different models or temperatures for different parts of your schema, OW is the only option that supports this.

When NOT to Use ObjectWeaver

Cross-field referential integrity. If field B must reference exact values from field A (entity IDs, foreign keys across arrays), single-pass coherence is unbeatable.

Simple, flat schemas. A handful of uniform fields doesn't justify orchestration overhead.

Cost-sensitive batch processing. OW uses ~100-270× more input tokens than a single structured call. If you're processing thousands of documents and depth isn't critical, structured output is dramatically cheaper.

The Token Economics

ObjectWeaver's input cost scales roughly as: Total Input Tokens ≈ N(fields) × T(document). For this benchmark: ~228 calls × ~4,400 tokens ≈ 1M input tokens vs. the structured API's 3,755 — a 270× multiplier. You pay in input tokens; you get back output depth, parallelism, and reasoning quality. Whether that trade-off makes sense depends on your use case.

What's Next

These two experiments establish the boundaries. We're now focused on surfacing OW's actual token tracking, hybrid approaches (structured output for coherent extraction, OW for progressive reasoning on top), and model routing benchmarks to test quality-per-dollar gains from routing simple fields to flash-lite.

Neither approach is universally better. The skill is knowing which one your schema needs.