NER OCR

A pipeline for NER using OCR

This project is developed in collaboration with the Centre for Advanced Research Computing, University College London.

About

Project Team

Mack Nixon ([email protected])

Research Software Engineering Contact

Centre for Advanced Research Computing, University College London ([email protected])

Getting Started

Prerequisites

ner-ocr requires Python 3.13–3.11.

Installation

Installing `uv`

uv is used for Python dependency management and managing virtual environments. You can install uv either using pipx or the uv installer script:

curl -LsSf https://astral.sh/uv/install.sh | sh

Installing Dependencies

Once uv is installed, install dependencies:

uv sync

Activate your Python environment

source .venv/bin/activate

Installing pre-commit hooks

Install pre-commit locally (in your activated venv) to aid code consistency (if you're looking to contribute).

pre-commit install

Docker Usage

This document explains how to build and run the ner-ocr Docker image using pre‑downloaded models stored on your local filesystem (or in a TRE).

The image:

Does not bake models in at build time.
Expects you to mount a models directory at runtime and tell it where to find:
- PaddleOCR models
- PaddleX models
- Hugging Face (HF) cache (for TrOCR and Qwen)

1. Prerequisites

Docker installed (Docker Desktop on macOS is fine).
Python (optional, only needed to generate requirements.txt and download models the first time).

Project structure (simplified):

ner-ocr/
  src/
  scripts/
    entrypoint.py
  data/
    input/    # your PDFs/images go here
    output/   # pipeline writes results here
  models/
    paddle_models/    # PaddleOCR cache (optional but recommended)
    paddlex_models/   # PaddleX models (optional but recommended)
    hf_cache/         # Hugging Face hub cache (TrOCR + Qwen)
  Dockerfile
  requirements.txt

You can choose any host path for models/; using ./models is just a convenient default.

2. Preparing models (one‑time, on your host)

You need to download all required models once on your host machine, then store them under models/ so they can be mounted into the container.

2.1. Download PaddleOCR / PaddleX models

In your local Python environment (not inside Docker):

cd /path/to/ner-ocr
python - << 'PYCODE'
from paddleocr import PaddleOCR

# Match your runtime settings (lang, ocr_version, etc.)
ocr = PaddleOCR(
    use_angle_cls=True,
    lang="en",
    ocr_version="PP-OCRv5",
)
print("PaddleOCR models downloaded.")
PYCODE

This will populate:

~/.paddleocr/whl
~/.paddlex/official_models

Copy them into your project models/ directory:

mkdir -p models/paddle_models models/paddlex_models

cp -R ~/.paddleocr/whl/. models/paddle_models/
cp -R ~/.paddlex/official_models/. models/paddlex_models/

2.2. Download TrOCR and Qwen models (Hugging Face)

In the same environment:

python - << 'PYCODE'
from transformers import (
    TrOCRProcessor,
    VisionEncoderDecoderModel,
    AutoTokenizer,
    AutoModelForCausalLM,
)

# TrOCR
trocr_name = "microsoft/trocr-large-handwritten"
TrOCRProcessor.from_pretrained(trocr_name)
VisionEncoderDecoderModel.from_pretrained(trocr_name)

# Qwen models used by entity_extraction
qwen_models = [
    "Qwen/Qwen3-4B-Instruct-2507",
    # add/remove here as needed
]

for name in qwen_models:
    AutoTokenizer.from_pretrained(name)
    AutoModelForCausalLM.from_pretrained(name)

print("TrOCR + Qwen models downloaded to HF cache.")
PYCODE

This will populate ~/.cache/huggingface/hub.

Copy the HF cache into models/hf_cache:

mkdir -p models/hf_cache
cp -R ~/.cache/huggingface/hub/. models/hf_cache/

Now your models/ tree should look like:

models/
  paddle_models/
    ... (paddle OCR files) ...
  paddlex_models/
    ... (PaddleX official_models) ...
  hf_cache/
    ... (HF hub repos: trocr, Qwen, etc.) ...

3. Build the Docker image

From the project root:

cd /path/to/ner-ocr
docker build -t ner-ocr:latest .

This uses:

FROM python:3.12-slim

WORKDIR /app

COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt

COPY src/ ./src/
COPY scripts/entrypoint.py ./entrypoint.py
COPY data/ ./data/

ENV PYTHONPATH=/app/src

ENTRYPOINT ["python", "entrypoint.py"]

The image now contains only code + dependencies; no models.

4. Running the container

You must:

Mount your local models/ directory.
Mount input/output data directories.
Pass the model directory paths to entrypoint.py.

4.1. Local run example (macOS)

Assuming:

Project root: /Users/you/Projects/ner-ocr
Models under ./models
Input PDFs/images under ./data/input
Output dir: ./data/output

Create input/output dirs if needed:

mkdir -p data/input data/output
# copy some test PDFs/images into data/input

Run:

docker run --rm -it \
  -v "$PWD/models":/models \
  -v "$PWD/data/input":/data/input \
  -v "$PWD/data/output":/data/output \
  ner-ocr:latest \
  --paddle-models-dir /models/paddle_models \
  --paddlex-models-dir /models/paddlex_models \
  --hf-cache-dir /models/hf_cache \
  -i /data/input \
  -o /data/output

Explanation:

-v "$PWD/models":/models mounts your host ./models folder to /models in the container.
--paddle-models-dir /models/paddle_models tells the entrypoint where PaddleOCR models are.
--paddlex-models-dir /models/paddlex_models does the same for PaddleX.
--hf-cache-dir /models/hf_cache points to the HF hub cache (TrOCR + Qwen).
-i /data/input and -o /data/output are pipeline input/output paths inside the container, mapped to your host ./data/input and ./data/output.

The scripts/entrypoint.py script will:

Optionally copy models into the expected runtime locations (e.g. /root/.paddleocr/whl, /root/.paddlex/official_models, /root/.cache/huggingface/hub), or just set env vars if you choose that approach.
Set PADDLEOCR_HOME, PADDLEX_HOME, and HF_HOME.
Run python -m src.pipeline with your -i and -o.

5. Running in a TRE

In a TRE you do the same thing conceptually:

Ensure your models storage is available inside the container (for example, mounted at /mnt/models).
Ensure input and output storage are mounted (/mnt/input, /mnt/output).
Configure the container command like:

python entrypoint.py \
  --paddle-models-dir /mnt/models/paddle_models \
  --paddlex-models-dir /mnt/models/paddlex_models \
  --hf-cache-dir /mnt/models/hf_cache \
  -i /mnt/input \
  -o /mnt/output

No rebuild is required when models change; you just update the mounted models directory.

Need to comment out paddlepaddle and torch in the requirements so that we manually install the gpu enabled versions in the docker file

6. Notes

If you run out of memory (SIGKILL / exit code 137), reduce model sizes (e.g. use microsoft/trocr-base-handwritten instead of the large model) or increase Docker memory (Docker Desktop → Settings → Resources).
You can also skip copying models at startup and directly mount them into the standard locations:
- /root/.paddleocr
- /root/.paddlex
- /root/.cache/huggingface if you prefer, as long as the directory structures match what the libraries expect.

docker build --platform linux/amd64 -t ner-ocr:amd64 .
docker save ner-ocr:amd64 | gzip > ner-ocr-amd64.tar.gz gzip -dc ner-ocr-amd64.tar.gz | docker load

Name		Name	Last commit message	Last commit date
Latest commit History 31 Commits
data/input		data/input
exploration		exploration
scripts		scripts
src		src
tests		tests
.gitignore		.gitignore
.pre-commit-config.yaml		.pre-commit-config.yaml
CITATION.cff		CITATION.cff
Dockerfile		Dockerfile
LICENSE.md		LICENSE.md
README.md		README.md
config.yaml		config.yaml
mkdocs.yml		mkdocs.yml
pyproject.toml		pyproject.toml
requirements.txt		requirements.txt
uv.lock		uv.lock

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

NER OCR

About

Project Team

Research Software Engineering Contact

Getting Started

Prerequisites

Installation

Installing `uv`

Installing Dependencies

Activate your Python environment

Installing pre-commit hooks

Docker Usage

1. Prerequisites

2. Preparing models (one‑time, on your host)

2.1. Download PaddleOCR / PaddleX models

2.2. Download TrOCR and Qwen models (Hugging Face)

3. Build the Docker image

4. Running the container

4.1. Local run example (macOS)

5. Running in a TRE

Need to comment out paddlepaddle and torch in the requirements so that we manually install the gpu enabled versions in the docker file

6. Notes

About

Uh oh!

Releases

Packages

Languages

License

UCL-ARC/ner-ocr

Folders and files

Latest commit

History

Repository files navigation

NER OCR

About

Project Team

Research Software Engineering Contact

Getting Started

Prerequisites

Installation

Installing uv

Installing Dependencies

Activate your Python environment

Installing pre-commit hooks

Docker Usage

1. Prerequisites

2. Preparing models (one‑time, on your host)

2.1. Download PaddleOCR / PaddleX models

2.2. Download TrOCR and Qwen models (Hugging Face)

3. Build the Docker image

4. Running the container

4.1. Local run example (macOS)

5. Running in a TRE

Need to comment out paddlepaddle and torch in the requirements so that we manually install the gpu enabled versions in the docker file

6. Notes

About

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Installing `uv`

Packages