Skip to content

Conversation

@alisonshao
Copy link
Collaborator

Problem

The current validation logic only checks files that are found by glob pattern matching. If a model's snapshot directory exists with an index file but actual weight files are missing (due to incomplete downloads or cache corruption), the validation passes and claims "Found local HF snapshot", then crashes with FileNotFoundError when trying to load the missing files.

Example from CI:

[TP0] Found local HF snapshot for openai/gpt-oss-120b at 
/hf_home/hub/models--openai--gpt-oss-120b/snapshots/...

FileNotFoundError: No such file or directory: 
.../model-00000-of-00014.safetensors

The issue is that glob only finds files that exist on disk. If files are missing entirely, they're never validated, so the system doesn't know they should exist.

Solution

Added _check_index_files_exist() function that:

  1. Reads the safetensors index file (model.safetensors.index.json)
  2. Extracts the complete list of required files from the weight_map
  3. Verifies that ALL files in the weight_map actually exist on disk
  4. Returns validation failure with specific missing filenames if any are absent

This function is integrated into _validate_sharded_model() and runs before other validation checks. When files are missing, validation fails and triggers a re-download instead of crashing during load.

Testing

  • Tested with simulated CI scenario (14-shard model with 1 missing file)
  • Validation correctly detects missing files and returns clear error message
  • Non-sharded models (no index file) are unaffected
  • All files present: validation passes as expected

Related

Extends the validation work from #13729 and #13870, which added corruption detection but didn't check for missing files.

@alisonshao alisonshao requested a review from hebiao064 as a code owner December 1, 2025 23:28
@gemini-code-assist
Copy link
Contributor

Warning

You have reached your daily quota limit. Please wait up to 24 hours and I will start processing your requests again!

## Problem

The current validation logic only checks files that are found by glob
pattern matching. If a model's snapshot directory exists with an index
file but actual weight files are missing (due to incomplete downloads
or cache corruption), the validation passes and claims "Found local HF
snapshot", then crashes with FileNotFoundError when trying to load the
missing files.

Example from CI:
```
[TP0] Found local HF snapshot for openai/gpt-oss-120b at
/hf_home/hub/models--openai--gpt-oss-120b/snapshots/...

FileNotFoundError: No such file or directory:
.../model-00000-of-00014.safetensors
```

The issue is that glob only finds files that exist on disk. If files
are missing entirely, they're never validated, so the system doesn't
know they should exist.

## Solution

Added `_check_index_files_exist()` function that:
1. Reads the safetensors index file (model.safetensors.index.json)
2. Extracts the complete list of required files from the weight_map
3. Verifies that ALL files in the weight_map actually exist on disk
4. Returns validation failure with specific missing filenames if any are absent

This function is integrated into `_validate_sharded_model()` and runs
before other validation checks. When files are missing, validation fails
and triggers a re-download instead of crashing during load.

## Testing

- Tested with simulated CI scenario (14-shard model with 1 missing file)
- Validation correctly detects missing files and returns clear error message
- Non-sharded models (no index file) are unaffected
- All files present: validation passes as expected
@alisonshao alisonshao force-pushed the fix/detect-missing-model-files-in-validation branch from c94974a to 56c6192 Compare December 1, 2025 23:32
@alisonshao
Copy link
Collaborator Author

/tag-and-rerun-ci

@github-actions github-actions bot added the run-ci label Dec 2, 2025
@alisonshao
Copy link
Collaborator Author

Local Testing Verification

Tested this implementation locally by simulating the exact CI failure scenario. All tests passed.

Test Setup

Created a test script (test_validation_manual.py) that:

  1. Creates a fake 14-shard model snapshot directory with model.safetensors.index.json
  2. Deliberately omits certain shard files to simulate incomplete downloads
  3. Runs the validation functions to verify detection

Test Cases & Results

Test Scenario Result
Missing files detection 14-shard model with shard 0 missing (exact CI failure) ✅ PASS
All files present 14-shard model with all files ✅ PASS
Multiple missing files 14-shard model with shards 0, 5, 10 missing ✅ PASS
Non-sharded model Single file model (no index) ✅ PASS

Example Output

Simulating the CI failure (missing model-00000-of-00014.safetensors):

Running _check_index_files_exist()...

[PASS] Validation correctly detected missing files!
  Error message: Missing 1 file(s) from index model.safetensors.index.json: ['model-00000-of-00014.safetensors']

CI Failure Reference

The original CI failure from this run:

[TP0] Found local HF snapshot for openai/gpt-oss-120b at /hf_home/hub/...
FileNotFoundError: No such file or directory: .../model-00000-of-00014.safetensors

The _check_index_files_exist() function now catches this before loading starts, preventing the crash and triggering a re-download instead.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants