Skip to content

Commit 4e197ec

Browse files
Add Azure Document Intelligence integration (#392)
* add integration page for azure di * Add docs link --------- Co-authored-by: Bilge Yücel <bilge.yucel@deepset.ai>
1 parent c3b1af9 commit 4e197ec

File tree

1 file changed

+105
-0
lines changed

1 file changed

+105
-0
lines changed
Lines changed: 105 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,105 @@
1+
---
2+
layout: integration
3+
name: Azure Document Intelligence
4+
description: Use Azure Document Intelligence with Haystack
5+
authors:
6+
- name: deepset
7+
socials:
8+
github: deepset-ai
9+
twitter: deepset_ai
10+
linkedin: https://www.linkedin.com/company/deepset-ai/
11+
pypi: https://pypi.org/project/azure-doc-intelligence-haystack
12+
repo: https://github.com/deepset-ai/haystack-core-integrations/tree/main/integrations/azure_doc_intelligence
13+
type: Converter
14+
report_issue: https://github.com/deepset-ai/haystack-core-integrations/issues
15+
logo: /logos/azure-ai.png
16+
version: Haystack 2.0
17+
toc: true
18+
---
19+
20+
### **Table of Contents**
21+
- [Overview](#overview)
22+
- [Installation](#installation)
23+
- [Usage](#usage)
24+
25+
## Overview
26+
27+
[`AzureDocumentIntelligenceConverter`](https://docs.haystack.deepset.ai/docs/azureocrdocumentconverter) provides an integration of [Azure Document Intelligence](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/) (formerly Form Recognizer) with [Haystack](https://haystack.deepset.ai/) by [deepset](https://www.deepset.ai).
28+
29+
This component uses Azure's Document Intelligence service to convert various file formats into Haystack Documents with markdown content. It supports advanced document analysis including layout detection, table extraction, and structured content recognition.
30+
31+
**Supported file formats**: PDF, JPEG, PNG, BMP, TIFF, DOCX, XLSX, PPTX, HTML.
32+
33+
**Key features**:
34+
- Markdown output with preserved structure (headings, tables, lists)
35+
- Inline table integration (tables rendered as markdown tables)
36+
- Improved layout analysis and reading order
37+
- Support for section headings
38+
- Multiple model options for different use cases
39+
40+
## Installation
41+
42+
Install the Azure Document Intelligence integration:
43+
44+
```bash
45+
pip install "azure-doc-intelligence-haystack"
46+
```
47+
48+
## Usage
49+
50+
To use the `AzureDocumentIntelligenceConverter`, you need an active [Azure subscription](https://azure.microsoft.com/en-us/products/ai-services/document-intelligence) with a deployed Document Intelligence or Cognitive Services resource. You need to provide a service endpoint as `AZURE_DI_ENDPOINT` and an API key as `AZURE_DI_API_KEY` for authentication.
51+
52+
```python
53+
import os
54+
from haystack_integrations.components.converters.azure_doc_intelligence import (
55+
AzureDocumentIntelligenceConverter,
56+
)
57+
from haystack.utils import Secret
58+
59+
converter = AzureDocumentIntelligenceConverter(
60+
endpoint=os.environ["AZURE_DI_ENDPOINT"],
61+
api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
62+
)
63+
64+
results = converter.run(sources=["invoice.pdf", "contract.docx"])
65+
documents = results["documents"]
66+
67+
# Documents contain markdown with inline tables
68+
print(documents[0].content)
69+
```
70+
71+
### Model Options
72+
73+
The converter supports different Azure Document Intelligence models depending on your needs:
74+
75+
- **`prebuilt-document`** (default): General document analysis with markdown output
76+
- **`prebuilt-read`**: Fast OCR for text extraction
77+
- **`prebuilt-layout`**: Enhanced layout analysis with better table and structure detection
78+
- **Custom models**: Use your own trained models by providing the model ID
79+
80+
```python
81+
# Use a specific model
82+
converter = AzureDocumentIntelligenceConverter(
83+
endpoint=os.environ["AZURE_DI_ENDPOINT"],
84+
api_key=Secret.from_env_var("AZURE_DI_API_KEY"),
85+
model_id="prebuilt-layout", # Enhanced layout analysis
86+
)
87+
```
88+
89+
### Metadata
90+
91+
The converter automatically adds metadata to each Document:
92+
- `model_id`: The Azure model used for analysis
93+
- `page_count`: Number of pages in the document
94+
- `file_path`: The source file path (filename only by default, or full path if `store_full_path=True`)
95+
96+
You can also provide custom metadata:
97+
98+
```python
99+
results = converter.run(
100+
sources=["document.pdf"],
101+
meta={"category": "legal", "priority": "high"}
102+
)
103+
```
104+
105+
For more details on Azure Document Intelligence capabilities and setup, refer to the [Azure documentation](https://learn.microsoft.com/en-us/azure/ai-services/document-intelligence/quickstarts/get-started-sdks-rest-api).

0 commit comments

Comments
 (0)