Extract best practices

The extract node is where most of the value (and most of the credit cost) lives. Reliable results come from a tight schema, the right engine choice, and a workflow that recovers from low-confidence cases. The principles below pair with the deeper schema design guide.

1. Use descriptive field names, not generic ones

Field names are the first signal the AI uses to decide what to extract. Names that read like real document terms beat generic labels. When this matters most: forms with several similar fields (multiple dates, multiple amounts, multiple parties) where a generic name leaves the AI guessing which one to pick.

// Effective
{
  "invoice_date": { "type": "string", "description": "Date the invoice was issued (ISO 8601)" },
  "due_date": { "type": "string", "description": "Payment due date (ISO 8601)" },
  "ship_date": { "type": "string", "description": "Date goods shipped (ISO 8601)" }
}

// Problematic
{
  "date1": { "type": "string" },
  "date2": { "type": "string" },
  "date3": { "type": "string" }
}

2. Write descriptions that locate the value

A good description tells the AI where to look, not just what the field means. Anchor descriptions to nearby labels, sections, or visual cues from the document.

// Effective
"total_amount": {
  "type": "number",
  "description": "Grand total in the bottom-right summary box, including tax and shipping"
}

// Problematic
"total_amount": {
  "type": "number",
  "description": "The total"
}

When a document has multiple plausible candidates (subtotal, tax, total, balance due), a locating description is the simplest fix.

3. Constrain values with enums when the set is fixed

If the document’s value comes from a known list (status, currency, document type), use a JSON Schema enum instead of a free string. Enums prevent the AI from inventing values and make downstream validation trivial.

// Effective
"currency": {
  "type": "string",
  "enum": ["USD", "EUR", "GBP", "CAD"],
  "description": "Currency code"
}

// Problematic
"currency": {
  "type": "string",
  "description": "Currency code"
}

Common candidates for enums: currency, country, status, document type, payment method.

4. Keep nesting shallow

Deep nesting is harder for the AI to populate consistently. Two levels (object containing an array of objects) is fine. Beyond that, results get fragile.

// Effective: flat top level, array of line items at depth 2
{
  "vendor_name": { "type": "string" },
  "total_amount": { "type": "number" },
  "line_items": {
    "type": "array",
    "items": {
      "type": "object",
      "properties": {
        "description": { "type": "string" },
        "amount":      { "type": "number" }
      }
    }
  }
}

// Problematic: nested objects for purely organizational reasons
{
  "header": {
    "type": "object",
    "properties": {
      "vendor": {
        "type": "object",
        "properties": {
          "identity": {
            "type": "object",
            "properties": {
              "name": { "type": "string" }
            }
          }
        }
      }
    }
  }
}

Reach for nesting only when the document itself has a nested structure (line items, parties, addresses). Otherwise flatten.

5. Extract only what’s on the page

The extract node is a reader, not a calculator. Asking it to compute, infer, or normalize a value that isn’t visible in the document leads to fabricated answers. Avoid fields like:

tax_rate when the document only shows the tax amount and the subtotal.
days_overdue when only the due date is on the page.
total_in_usd when the document is in euros.

Extract the raw values (subtotal, tax_amount, due_date, currency) and compute the rest in a transform node where the math is deterministic.

6. Pair confidence scores with a review step

Engine 1 can return a confidence score per field via the Field confidence toggle. On its own this just decorates the output. The pattern that makes it useful:

Enable Field confidence on the extract node.
Add a review node downstream with Pause mode = Unverified Fields.
Runs with any unverified field land in the review queue automatically; clean runs flow straight through.

This keeps your throughput high on easy documents and surfaces the hard cases for a human, with no manual sampling.

7. Match the engine to the document

Pick the engine that matches the document type. Engine 2 has no precision setting, no field confidence, and no OCR grounding; it’s a single-tier choice for the cases Engine 1 can’t handle cleanly.

Engine 1, Small / Medium: clean digital invoices, structured forms with a known layout. Cheapest, fastest. Default to Medium.
Engine 1, High: complex layouts, multilingual content, or cases where Engine 1 Medium produced wrong values.
Engine 1 with Force OCR + OCR grounding: when you need precise field highlights on a scanned PDF whose text layer is inaccurate. Adds 1 credit per page on top of the precision cost.
Engine 2: handwriting, low-quality scans, photos of paper, ambiguous layouts. Flat 2 credits per page.

Run a small benchmark batch when changing tiers; cost grows quickly and the accuracy gain isn’t always worth it.

8. Validate the shape, not just the content

Even with good descriptions, the AI can return null for fields the document doesn’t contain. Add a validation node after extract to enforce required fields, formats, and ranges. Validation errors make a great trigger for review or for routing the run to a fallback path.

Common pitfalls

Schema asks for a derived value the document doesn't show

Fields like subtotal_after_discount or total_in_usd get fabricated when the document doesn’t contain them. Extract the inputs and compute derived values in transform.

Multiple plausible candidates and no locating description

Documents with several dates, several amounts, or several parties confuse the AI when descriptions are generic. Anchor each description to a section header, label, or visual region.

Engine 2 enabled by default for cost reasons

Engine 2 costs 2 credits per page flat, double the cheapest Engine 1 tier. It also gives up Field confidence, OCR grounding, and the precision dial. If your documents are mostly clean, run Engine 1 (Small or Medium) and only fall back to Engine 2 on validation failure.

Field confidence on, review off

Enabling Field confidence without a downstream review node configured for Unverified Fields doesn’t change behavior; it just adds metadata. The pattern is enable + review.

One mega-schema for several document types

A schema designed to cover invoices, receipts, and statements at once produces nulls everywhere because no single document has every field. Use classify upstream to send each type to its own extract node with a focused schema.

JSON path used as a substitute for schema design

Setting a JSON path on extract narrows the output but doesn’t fix a vague schema. If results are wrong, fix the descriptions and field names first; reach for JSON path only to slice an already-good output.

Extract action

Configuration reference for the extract node

Schema design

Deeper guide to JSON Schema for DocPipe

Review action

Pause runs with unverified fields for human review

Validation action

Enforce shape and required fields after extraction

Getting started

Learn

Guides

Reference

Administration

Extract best practices

1. Use descriptive field names, not generic ones

2. Write descriptions that locate the value

3. Constrain values with enums when the set is fixed

4. Keep nesting shallow

5. Extract only what’s on the page

6. Pair confidence scores with a review step

7. Match the engine to the document

8. Validate the shape, not just the content

Common pitfalls

Extract action

Schema design

Review action

Validation action

Getting started

Learn

Guides

Reference

Administration

​1. Use descriptive field names, not generic ones

​2. Write descriptions that locate the value

​3. Constrain values with enums when the set is fixed

​4. Keep nesting shallow

​5. Extract only what’s on the page

​6. Pair confidence scores with a review step

​7. Match the engine to the document

​8. Validate the shape, not just the content

​Common pitfalls

​Related

Extract action

Schema design

Review action

Validation action

1. Use descriptive field names, not generic ones

2. Write descriptions that locate the value

3. Constrain values with enums when the set is fixed

4. Keep nesting shallow

5. Extract only what’s on the page

6. Pair confidence scores with a review step

7. Match the engine to the document

8. Validate the shape, not just the content

Common pitfalls

Related