Skip to content

Commit 0d206c7

Browse files
authored
docs(mistral): ocr examples (#370)
* docs(mistral): ocr examples * docs(mistral): better comment example
1 parent a6a82fb commit 0d206c7

File tree

1 file changed

+93
-1
lines changed

1 file changed

+93
-1
lines changed

integrations/mistral.md

Lines changed: 93 additions & 1 deletion
Original file line numberDiff line numberDiff line change
@@ -40,7 +40,8 @@ pip install mistral-haystack
4040

4141
## Usage
4242
### Components
43-
This instegration introduces 3 components:
43+
This integration introduces 4 components:
44+
- The `MistralOCRDocumentConverter`: Extracts text from documents using Mistral's OCR API, with optional structured annotations for image regions and full documents.
4445
- The [`MistralDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/mistraldocumentembedder): Creates embeddings for Haystack Documents using Mistral embedding models (currently only `mistral-embed`).
4546
- The [`MistralTextEmbedder`](https://docs.haystack.deepset.ai/docs/mistraltextembedder): Creates embeddings for texts (such as queries) using Mistral embedding models (currently only `mistral-embed`)
4647
- The [`MistralChatGenerator`](https://docs.haystack.deepset.ai/docs/mistralchatgenerator): Uses Mistral chat completion models such as `mistral-tiny` (default).
@@ -88,6 +89,97 @@ response = client.run(
8889
print(response)
8990
```
9091

92+
### Use Mistral OCR for Document Conversion
93+
94+
The `MistralOCRDocumentConverter` extracts text from documents (PDFs, images) using Mistral's OCR API. It supports multiple source types and can optionally enrich the output with structured annotations.
95+
96+
#### OCR with Embeddings Pipeline
97+
98+
Extract text from documents using OCR, split by pages, create embeddings, and store them in a document store:
99+
100+
```python
101+
import os
102+
from haystack import Pipeline
103+
from haystack.components.preprocessors import DocumentSplitter
104+
from haystack.components.writers import DocumentWriter
105+
from haystack.document_stores.in_memory import InMemoryDocumentStore
106+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
107+
from haystack_integrations.components.embedders.mistral import MistralDocumentEmbedder
108+
from mistralai.models import DocumentURLChunk
109+
110+
os.environ["MISTRAL_API_KEY"] = "YOUR_MISTRAL_API_KEY"
111+
112+
# Initialize document store
113+
document_store = InMemoryDocumentStore()
114+
115+
# Create indexing pipeline
116+
indexing_pipeline = Pipeline()
117+
indexing_pipeline.add_component("converter", MistralOCRDocumentConverter())
118+
indexing_pipeline.add_component(
119+
"splitter",
120+
DocumentSplitter(split_by="page", split_length=2, split_overlap=1)
121+
)
122+
indexing_pipeline.add_component("embedder", MistralDocumentEmbedder())
123+
indexing_pipeline.add_component("writer", DocumentWriter(document_store=document_store))
124+
125+
# Connect components
126+
indexing_pipeline.connect("converter.documents", "splitter.documents")
127+
indexing_pipeline.connect("splitter.documents", "embedder.documents")
128+
indexing_pipeline.connect("embedder.documents", "writer.documents")
129+
130+
# Process documents
131+
sources = [
132+
DocumentURLChunk(document_url="https://arxiv.org/pdf/1706.03762"),
133+
"./invoice.pdf", # Local PDF file
134+
]
135+
136+
result = indexing_pipeline.run({"converter": {"sources": sources}})
137+
138+
print(f"Indexed {len(document_store.filter_documents())} documents")
139+
```
140+
141+
#### OCR with Structured Annotations
142+
143+
Define Pydantic schemas to extract structured information from images and documents:
144+
145+
```python
146+
import os
147+
from typing import List
148+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
149+
from mistralai.models import DocumentURLChunk
150+
from pydantic import BaseModel, Field
151+
152+
os.environ["MISTRAL_API_KEY"] = "YOUR_MISTRAL_API_KEY"
153+
154+
# Define schema for image annotations (applied to each image/bbox)
155+
class ImageAnnotation(BaseModel):
156+
image_type: str = Field(..., description="Type of image (diagram, chart, photo, etc.)")
157+
description: str = Field(..., description="Brief description of the image content")
158+
159+
# Define schema for document-level annotations
160+
class DocumentAnnotation(BaseModel):
161+
topics: List[str] = Field(..., description="Main topics covered")
162+
urls: List[str] = Field(..., description="URLs found in the document")
163+
164+
converter = MistralOCRDocumentConverter()
165+
166+
sources = ["./financial_report.pdf"]
167+
168+
result = converter.run(
169+
sources=sources,
170+
bbox_annotation_schema=ImageAnnotation,
171+
document_annotation_schema=DocumentAnnotation,
172+
)
173+
174+
# Documents now include enriched content and metadata
175+
doc = result["documents"][0]
176+
print(doc.content) # Markdown with image annotations inline
177+
print(doc.meta["source_topics"]) # e.g., ["finance", "quarterly report", "revenue", "expenses", "performance"]
178+
print(doc.meta["source_urls"]) # e.g., ["https://example.com", ...]
179+
```
180+
181+
For a complete example with structured annotations in a pipeline, see the [OCR indexing pipeline example](https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/mistral/examples/indexing_ocr_pipeline.py).
182+
91183
### Use a Mistral Embedding Model
92184

93185
Use the `MistralDocumentEmbedder` in an indexing pipeline:

0 commit comments

Comments
 (0)