Skip to content

Releases: NVIDIA-NeMo/Curator

NVIDIA NeMo Curator 1.0.0

01 Oct 15:15
f0a761c

Choose a tag to compare

This major release represents a fundamental architecture shift from Dask to Ray, expanding NeMo Curator to support multimodal data curation with new video and audio capabilities. This refactor enables unified backend processing, better heterogeneous computing support, and enhanced autoscaling for dynamic workloads.

Installation Updates

  • New Docker container: Updated Docker infrastructure with CUDA 12.8.1 and Ubuntu 24.04 base; obtainable through the NGC Catalog (nvcr.io/nvidia/nemo-curator:25.09)

  • Docker file to build own image: Simplified Dockerfile structure for custom container builds with FFmpeg support

  • UV source installations: Integrated UV package manager (v0.8.22) for faster dependency management

  • PyPI improvements: Enhanced PyPI installation with modular extras for targeted functionality:

    Extra Installation Command Description
    All Modalities nemo-curator[all] Complete installation with all modalities and GPU support
    Text Curation nemo-curator[text_cuda12] GPU-accelerated text processing with RAPIDS
    Image Curation nemo-curator[image_cuda12] Image processing with NVIDIA DALI
    Audio Curation nemo-curator[audio_cuda12] Speech recognition with NeMo ASR models
    Video Curation nemo-curator[video_cuda12] Video processing with GPU acceleration
    Basic GPU nemo-curator[cuda12] CUDA utilities without modality-specific dependencies

    All GPU installations require the NVIDIA PyPI index:

    uv pip install --extra-index-url https://pypi.nvidia.com nemo-curator[EXTRA]

New Modalities

Video

NeMo Curator now supports comprehensive video data curation with distributed processing capabilities:

Audio

New audio curation capabilities for speech data processing:

Modality Refactors

Text

  • Ray backend migration: Complete transition from Dask to Ray for distributed text processing
  • Improved model-based classifier throughput: Better overlapping of compute between tokenization and inference through length-based sequence sorting for optimal GPU memory utilization
  • Task-centric architecture: New Task-based processing model for finer-grained control
  • Pipeline redesign: Updated ProcessingStage and Pipeline architecture with resource specification

Image

  • Pipeline-based architecture: Transitioned from legacy ImageTextPairDataset to modern stage-based processing with ImageReaderStage, ImageEmbeddingStage, and filter stages
  • DALI-based image loading: New ImageReaderStage uses NVIDIA DALI for high-performance WebDataset tar shard processing with GPU/CPU fallback
  • Modular processing stages: Separate stages for embedding generation, aesthetic filtering, and NSFW filtering
  • Task-based data flow: Images processed as ImageBatch tasks containing ImageObject instances with metadata, embeddings, and classification scores

Learn more about image curation.

Deduplication Improvements

Enhanced deduplication capabilities across all modalities with improved performance and flexibility:

  • Exact and Fuzzy deduplication: Updated rapidsmpf-based shuffle backend for more efficient GPU-to-GPU data transfer and better spilling capabilities
  • Semantic deduplication: Support for deduplicating text, image, and video datasets using unified embedding-based workflows
  • New ranking strategies: Added RankingStrategy which allows you to rank elements within cluster centers to decide which point to prioritize during duplicate removal, supporting metadata-based ranking to prioritize specific datasets or inputs

Core Refactors

The architecture refactor introduces a layered system with unified interfaces and multiple execution backends:

User Layer: Pipeline → ProcessingStage X→Y → ProcessingStage Y→Z → ProcessingStage Z→W
           ↓
Orchestration Layer: BaseExecutor Interface
           ↓
Backend Layer: XennaExecutor (Production Ready) | RayActorPoolExecutor (Experimental) | RayDataExecutor (Experimental)
           ↓
Adaptation Layer: Xenna Adapter | Ray Actor Adapter | Ray Data Adapter
           ↓
Execution Layer: Cosmos-Xenna (Streaming/Batch) | Ray Actor Pool (Load Balancing) | Ray Data API (Dataset Processing)

Pipelines

  • New Pipeline API: Ray-based pipeline execution with BaseExecutor interface
  • Multiple backends: Support for Xenna, Ray Actor Pool, and Ray Data execution backends
  • Resource specification: Configurable CPU and GPU memory requirements per stage
  • Stage composition: Improved stage validation and execution orchestration

Stages

  • ProcessingStage redesign: Generic ProcessingStage[X, Y] base class with type safety
  • Resource requirements: Built-in resource specification for CPU and GPU memory
  • Backend adapters: Stage adaptation layer for different Ray orchestration systems
  • Input/output validation: Enhanced type checking and data validation

Tutorials

For all tutorial content, refer to the tutorials directory in the NeMo Curator GitHub repository.

Known Limitations

(Pending Refactor in Future Release)

Generation

  • Synthetic data generation: Synthetic text generation features are being refactored for Ray compatibility
  • Hard negative mining: Retrieval-based data generation workflows under development

PII

  • PII processing: Personal Identifiable Information removal tools are being updated for Ray backend
  • Privacy workflows: Enhanced privacy-preserving data curation capabilities in development

Blending & Shuffling

  • Data blending: Multi-source dataset blending functionality being refactored
  • Dataset shuffling: Large-scale data shuffling operations under development

Docs Refactor

  • Local preview capability: Improved documentation build system with local preview support
  • **...
Read more

NVIDIA NeMo Curator 0.9.0

28 Jul 20:18
23da8c2

Choose a tag to compare

Major Features and Enhancements

  • New How-to Data Recipes (Tutorials)
    • Multimodal DAPT Curation w/ PDF Extraction
    • Llama Nemotron Data Curation
    • LLM NIM - PII Redaction
  • Performance and Code Optimizations
    • Simplified Clustering Logic: Significantly improved semantic deduplication clustering performance
    • Removed convoluted backend switching logic that caused performance issues
    • Eliminated expensive length assertions that could cause timeouts on large datasets
    • Improved GPU utilization during KMeans clustering operations
    • Tested on 37M embedding dataset (80GB) across 7 GPUs with substantial performance gains

Bug Fixes

  • FastText Download URL Fix
    • Corrected the fasttext model download URL in nemotron-cc tutorial
    • Changed from dl.fbaipublicfiles.com/fastText/ to dl.fbaipublicfiles.com/fasttext/
    • Ensures reliable model downloads for language identification
  • NeMo Retriever Tutorial Bug Fix
    • Fixed lambda function bug in RetrieverEvalSetGenerator
    • Corrected score assignment from df["question"].apply(lambda: 1) to df["score"] = 1
  • API Usage Updates
    • Updated examples and tutorials to use correct DocumentDataset API
    • Replaced deprecated write_to_disk(result, output_dir, output_type="parquet") with result.to_parquet(output_dir)
    • Updated exact deduplication workflows: deduplicator.remove() now returns DocumentDataset directly

NVIDIA NeMo Curator 0.8.0

09 May 01:11
cf12d34

Choose a tag to compare

  • Llama Based PII Redaction
  • Trafilatura Text Extractor
  • Chinese & Japanese Stopwords for Text Extractors
  • Writing gzip compressed jsonl datasets
  • Training dataset curation for retriever customization using hard-negative mining
  • Implemented a memory efficient pairwise similarity in Semantic Deduplication

NVIDIA NeMo Curator 0.8.0rc3.dev0

15 Apr 19:44
cff3cb6

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo Curator 0.8.0rc3.dev0 (2025-04-15)

NVIDIA NeMo Curator 0.8.0rc2.dev0

07 Apr 20:15
8cbd68f

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo Curator 0.8.0rc2.dev0 (2025-04-07)

NVIDIA NeMo Curator 0.7.1

31 Mar 22:52
d0cc62d

Choose a tag to compare

  • Fix Transformers + Cuda Context bug
  • Fix rate limit in SDG Retriever Eval Tutorial

NVIDIA NeMo Curator 0.7.0

12 Mar 21:22
f207c99

Choose a tag to compare

  • Python 3.12 Support
  • Curator on Blackwell
  • Nemotron-CC Dataset Recipe
  • Performant S3 for Fuzzy Deduplication

NVIDIA NeMo Curator 0.7.0rc2.dev0

25 Feb 13:12
6a05d29

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc2.dev0 (2025-02-25)

NVIDIA NeMo Curator 0.7.0rc1.dev1

19 Feb 18:21
c3ebcb5

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc1.dev1 (2025-02-19)

NVIDIA NeMo Curator 0.7.0rc0.dev1

04 Feb 21:41
7ab04ce

Choose a tag to compare

Pre-release

Prerelease: NVIDIA NeMo Curator 0.7.0rc0.dev1 (2025-02-04)