You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Copy file name to clipboardExpand all lines: integrations/mistral.md
+93-1Lines changed: 93 additions & 1 deletion
Display the source diff
Display the rich diff
Original file line number
Diff line number
Diff line change
@@ -40,7 +40,8 @@ pip install mistral-haystack
40
40
41
41
## Usage
42
42
### Components
43
-
This instegration introduces 3 components:
43
+
This integration introduces 4 components:
44
+
- The `MistralOCRDocumentConverter`: Extracts text from documents using Mistral's OCR API, with optional structured annotations for image regions and full documents.
44
45
- The [`MistralDocumentEmbedder`](https://docs.haystack.deepset.ai/docs/mistraldocumentembedder): Creates embeddings for Haystack Documents using Mistral embedding models (currently only `mistral-embed`).
45
46
- The [`MistralTextEmbedder`](https://docs.haystack.deepset.ai/docs/mistraltextembedder): Creates embeddings for texts (such as queries) using Mistral embedding models (currently only `mistral-embed`)
46
47
- The [`MistralChatGenerator`](https://docs.haystack.deepset.ai/docs/mistralchatgenerator): Uses Mistral chat completion models such as `mistral-tiny` (default).
@@ -88,6 +89,97 @@ response = client.run(
88
89
print(response)
89
90
```
90
91
92
+
### Use Mistral OCR for Document Conversion
93
+
94
+
The `MistralOCRDocumentConverter` extracts text from documents (PDFs, images) using Mistral's OCR API. It supports multiple source types and can optionally enrich the output with structured annotations.
95
+
96
+
#### OCR with Embeddings Pipeline
97
+
98
+
Extract text from documents using OCR, split by pages, create embeddings, and store them in a document store:
99
+
100
+
```python
101
+
import os
102
+
from haystack import Pipeline
103
+
from haystack.components.preprocessors import DocumentSplitter
104
+
from haystack.components.writers import DocumentWriter
105
+
from haystack.document_stores.in_memory import InMemoryDocumentStore
106
+
from haystack_integrations.components.converters.mistral import MistralOCRDocumentConverter
107
+
from haystack_integrations.components.embedders.mistral import MistralDocumentEmbedder
For a complete example with structured annotations in a pipeline, see the [OCR indexing pipeline example](https://github.com/deepset-ai/haystack-core-integrations/blob/main/integrations/mistral/examples/indexing_ocr_pipeline.py).
182
+
91
183
### Use a Mistral Embedding Model
92
184
93
185
Use the `MistralDocumentEmbedder` in an indexing pipeline:
0 commit comments