Alfresco Transformer: Docling-based PDF/Office/HTML/Image ➜ Markdown & JSON + md2pdf-based Markdown ➜ PDF
This project provides an Alfresco Content Services (ACS) transformer that converts a wide variety of document types into Markdown and JSON formats using Docling and md2pdf, lightweight document conversion libraries licensed under the MIT License.
The transformer runs inside a Docker container and can be integrated into Alfresco as a local transformer.
Docling supports the following source formats, converted into:
text/markdowntext/x-markdownapplication/json
- PDF (
application/pdf) - Word (.docx) (
application/vnd.openxmlformats-officedocument.wordprocessingml.document) - Excel (.xlsx) (
application/vnd.openxmlformats-officedocument.spreadsheetml.sheet) - PowerPoint (.pptx) (
application/vnd.openxmlformats-officedocument.presentationml.presentation) - Markdown (
text/markdown) - AsciiDoc (
text/asciidoc) - HTML (
text/html,application/xhtml+xml) - CSV (
text/csv) - Docling JSON (
application/vnd.docling+json)
- PNG (
image/png) - JPEG (
image/jpeg) - TIFF (
image/tiff) - BMP (
image/bmp)
- JATS XML (
application/vnd.jats+xml) - USPTO XML (
application/vnd.uspto+xml)
text/markdown => application/pdf
To create the transformer Docker image, run:
./run.sh buildThis uses alfresco-base-java and installs Python 3 and the docling package via pip.
To run the image :
./run.sh start
- Port 8090 is for transformations
- Port 8099 is for debugging
Example of request to transform a PDF file to Markdown:
curl --location --request POST 'http://localhost:8090/transform' \
--form 'file=@"/path/to/sample.pdf"' \
--form 'sourceMimetype="application/pdf"' \
--form 'targetMimetype="text/markdown"'You can declare the Docker service as follow in a docker-compose.yml file:
becpg-transform-markdown:
image: becpg-transform-markdown:1.0.0
ports:
- "8090:8090"
environment:
- SERVER_PORT=8090
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:8090/live"]
interval: 30s
timeout: 10s
retries: 5Add the following JVM property to your Alfresco instance:
-DlocalTransform.becpg-transform-markdown.url=http://localhost:8090/
This allows Alfresco to discover and use the transformer.
-
This project uses Docling and md2pdf, licensed under the MIT License.
-
Base image from Alfresco Docker Base Java
- Docling and md2pdf — for doing the heavy lifting
- Alfresco — for the open content services platform
- Community contributors and open-source maintainers
Feel free to fork, contribute, or open issues. Happy transforming!