|
| 1 | +--- |
| 2 | +layout: integration |
| 3 | +name: Azure Document Intelligence |
| 4 | +description: Use Azure Document Intelligence with Haystack |
| 5 | +authors: |
| 6 | + - name: deepset |
| 7 | + socials: |
| 8 | + github: deepset-ai |
| 9 | + twitter: deepset_ai |
| 10 | + linkedin: https://www.linkedin.com/company/deepset-ai/ |
| 11 | +pypi: https://pypi.org/project/azure-doc-intelligence-haystack |
| 12 | +repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence |
| 13 | +type: Converter |
| 14 | +report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues |
| 15 | +logo: /logos/azure-ai.png |
| 16 | +version: Haystack 2.0 |
| 17 | +toc: true |
| 18 | +--- |
| 19 | + |
| 20 | +### **Table of Contents** |
| 21 | +- [Overview](#overview) |
| 22 | +- [Installation](#installation) |
| 23 | +- [Usage](#usage) |
| 24 | + |
| 25 | +## Overview |
| 26 | + |
| 27 | +[`AzureDocumentIntelligenceConverter`](https://docs.haystack.deepset.ai/docs/azureocrdocumentconverter) provides an integration of [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/) (formerly Form Recognizer) with [Haystack](https://haystack.deepset.ai/) by [deepset](https://www.deepset.ai). |
| 28 | + |
| 29 | +This component uses Azure's Document Intelligence service to convert various file formats into Haystack Documents with markdown content. It supports advanced document analysis including layout detection, table extraction, and structured content recognition. |
| 30 | + |
| 31 | +**Supported file formats**: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML. |
| 32 | + |
| 33 | +**Key features**: |
| 34 | +- Markdown output with preserved structure (headings, tables, lists) |
| 35 | +- Inline table integration (tables rendered as markdown tables) |
| 36 | +- Improved layout analysis and reading order |
| 37 | +- Support for section headings |
| 38 | +- Multiple model options for different use cases |
| 39 | + |
| 40 | +## Installation |
| 41 | + |
| 42 | +Install the Azure Document Intelligence integration: |
| 43 | + |
| 44 | +```bash |
| 45 | +pip install "azure-doc-intelligence-haystack" |
| 46 | +``` |
| 47 | + |
| 48 | +## Usage |
| 49 | + |
| 50 | +To use the `AzureDocumentIntelligenceConverter`, you need an active [Azure subscription](https://azure.microsoft.com/en-us/products/ai-services/document-intelligence) with a deployed Document Intelligence or Cognitive Services resource. You need to provide a service endpoint as `AZURE_DI_ENDPOINT` and an API key as `AZURE_DI_API_KEY` for authentication. |
| 51 | + |
| 52 | +```python |
| 53 | +import os |
| 54 | +from haystack_integrations.components.converters.azure_doc_intelligence import ( |
| 55 | + AzureDocumentIntelligenceConverter, |
| 56 | +) |
| 57 | +from haystack.utils import Secret |
| 58 | + |
| 59 | +converter = AzureDocumentIntelligenceConverter( |
| 60 | + endpoint=os.environ["AZURE_DI_ENDPOINT"], |
| 61 | + api_key=Secret.from_env_var("AZURE_DI_API_KEY"), |
| 62 | +) |
| 63 | + |
| 64 | +results = converter.run(sources=["invoice.pdf", "contract.docx"]) |
| 65 | +documents = results["documents"] |
| 66 | + |
| 67 | +# Documents contain markdown with inline tables |
| 68 | +print(documents[0].content) |
| 69 | +``` |
| 70 | + |
| 71 | +### Model Options |
| 72 | + |
| 73 | +The converter supports different Azure Document Intelligence models depending on your needs: |
| 74 | + |
| 75 | +- **`prebuilt-document`** (default): General document analysis with markdown output |
| 76 | +- **`prebuilt-read`**: Fast OCR for text extraction |
| 77 | +- **`prebuilt-layout`**: Enhanced layout analysis with better table and structure detection |
| 78 | +- **Custom models**: Use your own trained models by providing the model ID |
| 79 | + |
| 80 | +```python |
| 81 | +# Use a specific model |
| 82 | +converter = AzureDocumentIntelligenceConverter( |
| 83 | + endpoint=os.environ["AZURE_DI_ENDPOINT"], |
| 84 | + api_key=Secret.from_env_var("AZURE_DI_API_KEY"), |
| 85 | + model_id="prebuilt-layout", # Enhanced layout analysis |
| 86 | +) |
| 87 | +``` |
| 88 | + |
| 89 | +### Metadata |
| 90 | + |
| 91 | +The converter automatically adds metadata to each Document: |
| 92 | +- `model_id`: The Azure model used for analysis |
| 93 | +- `page_count`: Number of pages in the document |
| 94 | +- `file_path`: The source file path (filename only by default, or full path if `store_full_path=True`) |
| 95 | + |
| 96 | +You can also provide custom metadata: |
| 97 | + |
| 98 | +```python |
| 99 | +results = converter.run( |
| 100 | + sources=["document.pdf"], |
| 101 | + meta={"category": "legal", "priority": "high"} |
| 102 | +) |
| 103 | +``` |
| 104 | + |
| 105 | +For more details on Azure Document Intelligence capabilities and setup, refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api). |
0 commit comments