A pipeline for NER using OCR
This project is developed in collaboration with the Centre for Advanced Research Computing, University College London.
Mack Nixon ([email protected])
Centre for Advanced Research Computing, University College London ([email protected])
ner-ocr requires Python 3.13–3.11.
uv is used for Python dependency management and managing virtual environments. You can install uv either using pipx or the uv installer script:
curl -LsSf https://astral.sh/uv/install.sh | shOnce uv is installed, install dependencies:
uv syncsource .venv/bin/activateInstall pre-commit locally (in your activated venv) to aid code consistency (if you're looking to contribute).
pre-commit installThis document explains how to build and run the ner-ocr Docker image using pre‑downloaded models stored on your local filesystem (or in a TRE).
The image:
- Does not bake models in at build time.
- Expects you to mount a models directory at runtime and tell it where to find:
- PaddleOCR models
- PaddleX models
- Hugging Face (HF) cache (for TrOCR and Qwen)
- Docker installed (Docker Desktop on macOS is fine).
- Python (optional, only needed to generate
requirements.txtand download models the first time).
Project structure (simplified):
ner-ocr/
src/
scripts/
entrypoint.py
data/
input/ # your PDFs/images go here
output/ # pipeline writes results here
models/
paddle_models/ # PaddleOCR cache (optional but recommended)
paddlex_models/ # PaddleX models (optional but recommended)
hf_cache/ # Hugging Face hub cache (TrOCR + Qwen)
Dockerfile
requirements.txt
You can choose any host path for models/; using ./models is just a convenient default.
You need to download all required models once on your host machine, then store them under models/ so they can be mounted into the container.
In your local Python environment (not inside Docker):
cd /path/to/ner-ocr
python - << 'PYCODE'
from paddleocr import PaddleOCR
# Match your runtime settings (lang, ocr_version, etc.)
ocr = PaddleOCR(
use_angle_cls=True,
lang="en",
ocr_version="PP-OCRv5",
)
print("PaddleOCR models downloaded.")
PYCODEThis will populate:
~/.paddleocr/whl~/.paddlex/official_models
Copy them into your project models/ directory:
mkdir -p models/paddle_models models/paddlex_models
cp -R ~/.paddleocr/whl/. models/paddle_models/
cp -R ~/.paddlex/official_models/. models/paddlex_models/In the same environment:
python - << 'PYCODE'
from transformers import (
TrOCRProcessor,
VisionEncoderDecoderModel,
AutoTokenizer,
AutoModelForCausalLM,
)
# TrOCR
trocr_name = "microsoft/trocr-large-handwritten"
TrOCRProcessor.from_pretrained(trocr_name)
VisionEncoderDecoderModel.from_pretrained(trocr_name)
# Qwen models used by entity_extraction
qwen_models = [
"Qwen/Qwen3-4B-Instruct-2507",
# add/remove here as needed
]
for name in qwen_models:
AutoTokenizer.from_pretrained(name)
AutoModelForCausalLM.from_pretrained(name)
print("TrOCR + Qwen models downloaded to HF cache.")
PYCODEThis will populate ~/.cache/huggingface/hub.
Copy the HF cache into models/hf_cache:
mkdir -p models/hf_cache
cp -R ~/.cache/huggingface/hub/. models/hf_cache/Now your models/ tree should look like:
models/
paddle_models/
... (paddle OCR files) ...
paddlex_models/
... (PaddleX official_models) ...
hf_cache/
... (HF hub repos: trocr, Qwen, etc.) ...
From the project root:
cd /path/to/ner-ocr
docker build -t ner-ocr:latest .This uses:
FROM python:3.12-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY src/ ./src/
COPY scripts/entrypoint.py ./entrypoint.py
COPY data/ ./data/
ENV PYTHONPATH=/app/src
ENTRYPOINT ["python", "entrypoint.py"]The image now contains only code + dependencies; no models.
You must:
- Mount your local
models/directory. - Mount input/output data directories.
- Pass the model directory paths to
entrypoint.py.
Assuming:
- Project root:
/Users/you/Projects/ner-ocr - Models under
./models - Input PDFs/images under
./data/input - Output dir:
./data/output
Create input/output dirs if needed:
mkdir -p data/input data/output
# copy some test PDFs/images into data/inputRun:
docker run --rm -it \
-v "$PWD/models":/models \
-v "$PWD/data/input":/data/input \
-v "$PWD/data/output":/data/output \
ner-ocr:latest \
--paddle-models-dir /models/paddle_models \
--paddlex-models-dir /models/paddlex_models \
--hf-cache-dir /models/hf_cache \
-i /data/input \
-o /data/outputExplanation:
-v "$PWD/models":/modelsmounts your host./modelsfolder to/modelsin the container.--paddle-models-dir /models/paddle_modelstells the entrypoint where PaddleOCR models are.--paddlex-models-dir /models/paddlex_modelsdoes the same for PaddleX.--hf-cache-dir /models/hf_cachepoints to the HF hub cache (TrOCR + Qwen).-i /data/inputand-o /data/outputare pipeline input/output paths inside the container, mapped to your host./data/inputand./data/output.
The scripts/entrypoint.py script will:
- Optionally copy models into the expected runtime locations (e.g.
/root/.paddleocr/whl,/root/.paddlex/official_models,/root/.cache/huggingface/hub), or just set env vars if you choose that approach. - Set
PADDLEOCR_HOME,PADDLEX_HOME, andHF_HOME. - Run
python -m src.pipelinewith your-iand-o.
In a TRE you do the same thing conceptually:
- Ensure your models storage is available inside the container (for example, mounted at
/mnt/models). - Ensure input and output storage are mounted (
/mnt/input,/mnt/output). - Configure the container command like:
python entrypoint.py \
--paddle-models-dir /mnt/models/paddle_models \
--paddlex-models-dir /mnt/models/paddlex_models \
--hf-cache-dir /mnt/models/hf_cache \
-i /mnt/input \
-o /mnt/outputNo rebuild is required when models change; you just update the mounted models directory.
Need to comment out paddlepaddle and torch in the requirements so that we manually install the gpu enabled versions in the docker file
- If you run out of memory (SIGKILL / exit code 137), reduce model sizes (e.g. use
microsoft/trocr-base-handwritteninstead of the large model) or increase Docker memory (Docker Desktop → Settings → Resources). - You can also skip copying models at startup and directly mount them into the standard locations:
/root/.paddleocr/root/.paddlex/root/.cache/huggingfaceif you prefer, as long as the directory structures match what the libraries expect.
docker build --platform linux/amd64 -t ner-ocr:amd64 .
docker save ner-ocr:amd64 | gzip > ner-ocr-amd64.tar.gz
gzip -dc ner-ocr-amd64.tar.gz | docker load