The extraction schema defines what structured data the AI extracts from your documents. A well-designed schema produces more accurate and consistent results.
Schema structure
A schema uses JSON Schema Draft-07 format. The root must be an object type with properties defining each field:
{
"type": "object",
"properties": {
"vendor_name": {
"type": "string",
"description": "Name of the vendor or supplier"
},
"total_amount": {
"type": "number",
"description": "Total amount due including tax"
}
},
"required": ["vendor_name", "total_amount"]
}
The required array lists fields that should always be extracted. Fields not listed in required may return null if not found in the document.
Field types
| Type | Description | Example value |
|---|
string | Text value | "Acme Corp" |
number | Numeric value (integer or decimal) | 1250.00 |
boolean | True or false | true |
object | Nested object with its own properties | See below |
array | List of nested objects | See below |
There is no date type. For date fields, use "type": "string" with a description indicating the expected format (e.g., “Date in ISO 8601 format YYYY-MM-DD”).
Nested fields and arrays
For repeating data like line items, use the array type with an items object that defines the structure of each element:
{
"line_items": {
"type": "array",
"description": "Individual line items on the invoice",
"items": {
"type": "object",
"properties": {
"description": { "type": "string", "description": "Item description" },
"quantity": { "type": "number", "description": "Quantity ordered" },
"unit_price": { "type": "number", "description": "Price per unit" },
"amount": { "type": "number", "description": "Line total" }
}
}
}
}
Note that items is a single schema object (not an array) describing the shape of each array element.
For documents with tabular data (invoices, purchase orders, statements), use arrays to capture table rows. The AI identifies table structures and maps columns to your defined fields.
Tips for table extraction:
- Name fields to match common column headers
- Include a description that mentions the column header name if it differs from the field name
- Test with documents that have varying table formats
Best practices
Write descriptive field descriptions
The description field is used by the AI to understand what to extract. Be specific:
// Good
"total": { "type": "number", "description": "Grand total amount including tax and shipping" }
// Less effective
"total": { "type": "number", "description": "Total" }
Use specific field names
Choose field names that clearly indicate the data:
// Good
"invoice_date": { "type": "string", "description": "Invoice issue date (ISO 8601)" }
"due_date": { "type": "string", "description": "Payment due date (ISO 8601)" }
// Ambiguous
"date1": { "type": "string" }
"date2": { "type": "string" }
Use the extract action’s Instructions field to provide context the AI can use:
- “Dates should be in ISO 8601 format (YYYY-MM-DD)”
- “If a field is not found in the document, return null”
- “The total should include tax. If tax is listed separately, add it to the subtotal”
Start simple, iterate
Begin with a small number of high-value fields. Test with real documents, review the results, and gradually add more fields as you confirm accuracy.
Handle missing data
Not every document contains every field. The AI returns null for fields it cannot find. Design your downstream processing to handle missing values gracefully.
Example schemas
Invoice
{
"type": "object",
"properties": {
"vendor_name": { "type": "string", "description": "Name of the vendor or company issuing the invoice" },
"invoice_number": { "type": "string", "description": "Invoice or reference number" },
"invoice_date": { "type": "string", "description": "Date the invoice was issued (ISO 8601)" },
"due_date": { "type": "string", "description": "Payment due date (ISO 8601)" },
"subtotal": { "type": "number", "description": "Subtotal before tax" },
"tax_amount": { "type": "number", "description": "Total tax amount" },
"total_amount": { "type": "number", "description": "Grand total including tax" },
"currency": { "type": "string", "description": "Currency code (e.g., USD, EUR, GBP)" },
"line_items": {
"type": "array",
"description": "Invoice line items",
"items": {
"type": "object",
"properties": {
"description": { "type": "string", "description": "Item or service description" },
"quantity": { "type": "number", "description": "Quantity" },
"unit_price": { "type": "number", "description": "Price per unit" },
"amount": { "type": "number", "description": "Line total" }
}
}
}
},
"required": ["vendor_name", "invoice_number", "total_amount"]
}
Receipt
{
"type": "object",
"properties": {
"merchant_name": { "type": "string", "description": "Name of the store or merchant" },
"transaction_date": { "type": "string", "description": "Date of purchase (ISO 8601)" },
"total": { "type": "number", "description": "Total amount paid" },
"payment_method": { "type": "string", "description": "Payment method used (cash, credit card, etc.)" },
"items": {
"type": "array",
"description": "Purchased items",
"items": {
"type": "object",
"properties": {
"name": { "type": "string", "description": "Item name" },
"price": { "type": "number", "description": "Item price" }
}
}
}
},
"required": ["merchant_name", "total"]
}