1. Use descriptive field names, not generic ones
Field names are the first signal the AI uses to decide what to extract. Names that read like real document terms beat generic labels. When this matters most: forms with several similar fields (multiple dates, multiple amounts, multiple parties) where a generic name leaves the AI guessing which one to pick.2. Write descriptions that locate the value
A good description tells the AI where to look, not just what the field means. Anchor descriptions to nearby labels, sections, or visual cues from the document.3. Constrain values with enums when the set is fixed
If the document’s value comes from a known list (status, currency, document type), use a JSON Schemaenum instead of a free string. Enums prevent the AI from inventing values and make downstream validation trivial.
4. Keep nesting shallow
Deep nesting is harder for the AI to populate consistently. Two levels (object containing an array of objects) is fine. Beyond that, results get fragile.5. Extract only what’s on the page
The extract node is a reader, not a calculator. Asking it to compute, infer, or normalize a value that isn’t visible in the document leads to fabricated answers. Avoid fields like:tax_ratewhen the document only shows the tax amount and the subtotal.days_overduewhen only the due date is on the page.total_in_usdwhen the document is in euros.
subtotal, tax_amount, due_date, currency) and compute the rest in a transform node where the math is deterministic.
6. Pair confidence scores with a review step
Engine 1 can return a confidence score per field via the Field confidence toggle. On its own this just decorates the output. The pattern that makes it useful:- Enable Field confidence on the extract node.
- Add a review node downstream with Pause mode = Unverified Fields.
- Runs with any unverified field land in the review queue automatically; clean runs flow straight through.
7. Match the engine to the document
Pick the engine that matches the document type. Engine 2 has no precision setting, no field confidence, and no OCR grounding; it’s a single-tier choice for the cases Engine 1 can’t handle cleanly.- Engine 1, Small / Medium: clean digital invoices, structured forms with a known layout. Cheapest, fastest. Default to Medium.
- Engine 1, High: complex layouts, multilingual content, or cases where Engine 1 Medium produced wrong values.
- Engine 1 with Force OCR + OCR grounding: when you need precise field highlights on a scanned PDF whose text layer is inaccurate. Adds 1 credit per page on top of the precision cost.
- Engine 2: handwriting, low-quality scans, photos of paper, ambiguous layouts. Flat 2 credits per page.
8. Validate the shape, not just the content
Even with good descriptions, the AI can return null for fields the document doesn’t contain. Add a validation node after extract to enforce required fields, formats, and ranges. Validation errors make a great trigger for review or for routing the run to a fallback path.Common pitfalls
Schema asks for a derived value the document doesn't show
Schema asks for a derived value the document doesn't show
Fields like
subtotal_after_discount or total_in_usd get fabricated when the document doesn’t contain them. Extract the inputs and compute derived values in transform.Multiple plausible candidates and no locating description
Multiple plausible candidates and no locating description
Documents with several dates, several amounts, or several parties confuse the AI when descriptions are generic. Anchor each description to a section header, label, or visual region.
Engine 2 enabled by default for cost reasons
Engine 2 enabled by default for cost reasons
Engine 2 costs 2 credits per page flat, double the cheapest Engine 1 tier. It also gives up Field confidence, OCR grounding, and the precision dial. If your documents are mostly clean, run Engine 1 (Small or Medium) and only fall back to Engine 2 on validation failure.
Field confidence on, review off
Field confidence on, review off
Enabling Field confidence without a downstream review node configured for Unverified Fields doesn’t change behavior; it just adds metadata. The pattern is enable + review.
One mega-schema for several document types
One mega-schema for several document types
A schema designed to cover invoices, receipts, and statements at once produces nulls everywhere because no single document has every field. Use classify upstream to send each type to its own extract node with a focused schema.
JSON path used as a substitute for schema design
JSON path used as a substitute for schema design
Setting a JSON path on extract narrows the output but doesn’t fix a vague schema. If results are wrong, fix the descriptions and field names first; reach for JSON path only to slice an already-good output.
Related
Extract action
Configuration reference for the extract node
Schema design
Deeper guide to JSON Schema for DocPipe
Review action
Pause runs with unverified fields for human review
Validation action
Enforce shape and required fields after extraction