Azure AI Document Intelligence: automatic data extraction from documents

João Barros 01 de August de 2025 2 min read

Azure AI Document Intelligence (formerly Form Recognizer) uses AI models to extract text, tables and key fields from documents — invoices, receipts, contracts, forms — with high accuracy and no fixed template.

Available prebuilt models

prebuilt-invoice        → invoices (fields: VendorName, InvoiceDate, TotalTax, ...)
prebuilt-receipt        → receipts (MerchantName, TransactionDate, Total, ...)
prebuilt-idDocument     → ID/passport (FirstName, LastName, DocumentNumber, ...)
prebuilt-businessCard   → business cards (ContactNames, Emails, ...)
prebuilt-read           → generic text extraction with structure
prebuilt-layout         → text + tables + selection marks

Analyze an invoice

from azure.ai.formrecognizer import DocumentAnalysisClient
from azure.core.credentials import AzureKeyCredential

client = DocumentAnalysisClient(
    endpoint=os.environ["DOC_INTEL_ENDPOINT"],
    credential=AzureKeyCredential(os.environ["DOC_INTEL_KEY"])
)

with open("invoice.pdf", "rb") as f:
    poller = client.begin_analyze_document("prebuilt-invoice", f)

result = poller.result()

for invoice in result.documents:
    fields = invoice.fields
    print(f"Vendor:     {fields.get('VendorName').value}")
    print(f"Date:       {fields.get('InvoiceDate').value}")
    print(f"Total:      {fields.get('InvoiceTotal').value.amount} {fields.get('InvoiceTotal').value.currency_symbol}")
    print(f"Confidence: {fields.get('InvoiceTotal').confidence:.0%}")

Custom model — train on your documents

# 1. Upload 5+ sample documents + labels in Document Intelligence Studio
# 2. Train the custom model (3-5 minutes)
# 3. Use the generated model_id:

poller = client.begin_analyze_document(
    model_id="bconcepts-contracts-model",
    document=f
)
result = poller.result()
# Fields defined in the labels: ContractNumber, StartDate, AnnualValue, etc.

Integrate into a processing pipeline

# Power Automate: HTTP POST to the Document Intelligence API
# → Parse the JSON response → Save fields to SharePoint / Dataverse / SQL
# Or: an Azure Function triggered by Blob Storage → processes each PDF on arrival

Conclusion

Document Intelligence eliminates the manual work of extracting data from documents. With prebuilt models, there is no OCR code to write — just call the API and process the structured JSON. For organization-specific documents, custom models reach 95%+ accuracy with just 5 training examples.