Skip to content

UCL-ARC/ner-ocr

Repository files navigation

NER OCR

pre-commit Tests status Linting status Documentation status License

A pipeline for NER using OCR

This project is developed in collaboration with the Centre for Advanced Research Computing, University College London.

About

Project Team

Mack Nixon ([email protected])

Research Software Engineering Contact

Centre for Advanced Research Computing, University College London ([email protected])

Getting Started

Prerequisites

ner-ocr requires Python 3.13–3.11.

Installation

Installing uv

uv is used for Python dependency management and managing virtual environments. You can install uv either using pipx or the uv installer script:

curl -LsSf https://astral.sh/uv/install.sh | sh

Installing Dependencies

Once uv is installed, install dependencies:

uv sync

Activate your Python environment

source .venv/bin/activate

Installing pre-commit hooks

Install pre-commit locally (in your activated venv) to aid code consistency (if you're looking to contribute).

pre-commit install

Docker Usage

This document explains how to build and run the ner-ocr Docker image using pre‑downloaded models stored on your local filesystem (or in a TRE).

The image:

  • Does not bake models in at build time.
  • Expects you to mount a models directory at runtime and tell it where to find:
    • PaddleOCR models
    • PaddleX models
    • Hugging Face (HF) cache (for TrOCR and Qwen)

1. Prerequisites

  • Docker installed (Docker Desktop on macOS is fine).
  • Python (optional, only needed to generate requirements.txt and download models the first time).

Project structure (simplified):

ner-ocr/
  src/
  scripts/
    entrypoint.py
  data/
    input/    # your PDFs/images go here
    output/   # pipeline writes results here
  models/
    paddle_models/    # PaddleOCR cache (optional but recommended)
    paddlex_models/   # PaddleX models (optional but recommended)
    hf_cache/         # Hugging Face hub cache (TrOCR + Qwen)
  Dockerfile
  requirements.txt

You can choose any host path for models/; using ./models is just a convenient default.


2. Preparing models (one‑time, on your host)

You need to download all required models once on your host machine, then store them under models/ so they can be mounted into the container.

2.1. Download PaddleOCR / PaddleX models

In your local Python environment (not inside Docker):

cd /path/to/ner-ocr
python - << 'PYCODE'
from paddleocr import PaddleOCR

# Match your runtime settings (lang, ocr_version, etc.)
ocr = PaddleOCR(
    use_angle_cls=True,
    lang="en",
    ocr_version="PP-OCRv5",
)
print("PaddleOCR models downloaded.")
PYCODE

This will populate:

  • ~/.paddleocr/whl
  • ~/.paddlex/official_models

Copy them into your project models/ directory:

mkdir -p models/paddle_models models/paddlex_models

cp -R ~/.paddleocr/whl/. models/paddle_models/
cp -R ~/.paddlex/official_models/. models/paddlex_models/

2.2. Download TrOCR and Qwen models (Hugging Face)

In the same environment:

python - << 'PYCODE'
from transformers import (
    TrOCRProcessor,
    VisionEncoderDecoderModel,
    AutoTokenizer,
    AutoModelForCausalLM,
)

# TrOCR
trocr_name = "microsoft/trocr-large-handwritten"
TrOCRProcessor.from_pretrained(trocr_name)
VisionEncoderDecoderModel.from_pretrained(trocr_name)

# Qwen models used by entity_extraction
qwen_models = [
    "Qwen/Qwen3-4B-Instruct-2507",
    # add/remove here as needed
]

for name in qwen_models:
    AutoTokenizer.from_pretrained(name)
    AutoModelForCausalLM.from_pretrained(name)

print("TrOCR + Qwen models downloaded to HF cache.")
PYCODE

This will populate ~/.cache/huggingface/hub.

Copy the HF cache into models/hf_cache:

mkdir -p models/hf_cache
cp -R ~/.cache/huggingface/hub/. models/hf_cache/

Now your models/ tree should look like:

models/
  paddle_models/
    ... (paddle OCR files) ...
  paddlex_models/
    ... (PaddleX official_models) ...
  hf_cache/
    ... (HF hub repos: trocr, Qwen, etc.) ...

3. Build the Docker image

From the project root:

cd /path/to/ner-ocr
docker build -t ner-ocr:latest .

This uses:

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/
COPY scripts/entrypoint.py ./entrypoint.py
COPY data/ ./data/

ENV PYTHONPATH=/app/src

ENTRYPOINT ["python", "entrypoint.py"]

The image now contains only code + dependencies; no models.


4. Running the container

You must:

  • Mount your local models/ directory.
  • Mount input/output data directories.
  • Pass the model directory paths to entrypoint.py.

4.1. Local run example (macOS)

Assuming:

  • Project root: /Users/you/Projects/ner-ocr
  • Models under ./models
  • Input PDFs/images under ./data/input
  • Output dir: ./data/output

Create input/output dirs if needed:

mkdir -p data/input data/output
# copy some test PDFs/images into data/input

Run:

docker run --rm -it \
  -v "$PWD/models":/models \
  -v "$PWD/data/input":/data/input \
  -v "$PWD/data/output":/data/output \
  ner-ocr:latest \
  --paddle-models-dir /models/paddle_models \
  --paddlex-models-dir /models/paddlex_models \
  --hf-cache-dir /models/hf_cache \
  -i /data/input \
  -o /data/output

Explanation:

  • -v "$PWD/models":/models mounts your host ./models folder to /models in the container.
  • --paddle-models-dir /models/paddle_models tells the entrypoint where PaddleOCR models are.
  • --paddlex-models-dir /models/paddlex_models does the same for PaddleX.
  • --hf-cache-dir /models/hf_cache points to the HF hub cache (TrOCR + Qwen).
  • -i /data/input and -o /data/output are pipeline input/output paths inside the container, mapped to your host ./data/input and ./data/output.

The scripts/entrypoint.py script will:

  1. Optionally copy models into the expected runtime locations (e.g. /root/.paddleocr/whl, /root/.paddlex/official_models, /root/.cache/huggingface/hub), or just set env vars if you choose that approach.
  2. Set PADDLEOCR_HOME, PADDLEX_HOME, and HF_HOME.
  3. Run python -m src.pipeline with your -i and -o.

5. Running in a TRE

In a TRE you do the same thing conceptually:

  1. Ensure your models storage is available inside the container (for example, mounted at /mnt/models).
  2. Ensure input and output storage are mounted (/mnt/input, /mnt/output).
  3. Configure the container command like:
python entrypoint.py \
  --paddle-models-dir /mnt/models/paddle_models \
  --paddlex-models-dir /mnt/models/paddlex_models \
  --hf-cache-dir /mnt/models/hf_cache \
  -i /mnt/input \
  -o /mnt/output

No rebuild is required when models change; you just update the mounted models directory.

Need to comment out paddlepaddle and torch in the requirements so that we manually install the gpu enabled versions in the docker file

6. Notes

  • If you run out of memory (SIGKILL / exit code 137), reduce model sizes (e.g. use microsoft/trocr-base-handwritten instead of the large model) or increase Docker memory (Docker Desktop → Settings → Resources).
  • You can also skip copying models at startup and directly mount them into the standard locations:
    • /root/.paddleocr
    • /root/.paddlex
    • /root/.cache/huggingface if you prefer, as long as the directory structures match what the libraries expect.

docker build --platform linux/amd64 -t ner-ocr:amd64 .
docker save ner-ocr:amd64 | gzip > ner-ocr-amd64.tar.gz gzip -dc ner-ocr-amd64.tar.gz | docker load

About

A pipeline for extracting entities from documents

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published