Skip to content

Conversation

Copy link

Copilot AI commented Jul 8, 2025

Adds a new --validate-images CLI option that enables PIL-based image validation to filter out corrupt or invalid image files during processing.

Problem

When processing image datasets, users may encounter corrupt or invalid image files that cause processing to fail. Currently, zamba only checks if files exist and have non-zero size, but doesn't validate that they are actually valid images that can be opened and processed.

Solution

This PR adds a new CLI option --validate-images that:

  • Attempts to open each image file with PIL (Python Imaging Library)
  • Filters out images that cannot be opened or decoded
  • Logs appropriate warning messages about filtered files
  • Continues processing with only valid images

Usage

Command Line Interface

For image prediction:

zamba image predict --data-dir /path/to/images --validate-images

For image training:

zamba image train --data-dir /path/to/images --labels /path/to/labels.csv --validate-images

Python API

from zamba.images.config import ImageClassificationPredictConfig

config = ImageClassificationPredictConfig(
    data_dir="/path/to/images",
    validate_images=True
)

Implementation Details

  • Backward Compatible: Feature is disabled by default (validate_images=False)
  • Comprehensive Logging: Distinguishes between file existence failures and PIL validation failures
  • Efficient Processing: Uses parallel processing for training validation
  • Robust Error Handling: Gracefully handles all PIL-related exceptions

Changes Made

  1. CLI Enhancement: Added --validate-images option to both predict and train commands
  2. Configuration: Added validate_images: bool = False parameter to both config classes
  3. Validation Logic: Implemented _validate_filepath_with_pil() function using PIL
  4. Integration: Enhanced existing validation methods to use PIL when enabled
  5. Logging: Added specific messages for PIL validation failures
  6. Tests: Comprehensive test suite covering all functionality
  7. Documentation: Added detailed usage examples and documentation

Example Output

With validation enabled, users will see:

INFO     | Validating image files exist and can be opened with PIL
WARNING  | 2 files in provided labels file do not exist on disk or cannot be opened with PIL; ignoring those files. Example: ['corrupt_image.jpg', 'invalid_file.jpg']...

This feature is particularly useful when working with datasets from external sources or when data integrity is uncertain.


💡 You can make Copilot smarter by setting up custom instructions, customizing its development environment and configuring Model Context Protocol (MCP) servers. Learn more Copilot coding agent tips in the docs.

Copilot AI changed the title [WIP] Can you add a CLI option for images that will try to open images with PIL and filter them out of the list we are processing (including adding a message in the logs) if they fail to load? Add --validate-images CLI option to filter corrupt images using PIL Jul 8, 2025
Copilot AI requested a review from pjbull July 8, 2025 21:18
Copilot finished work on behalf of pjbull July 8, 2025 21:18
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants