Skip to content

Conversation

@ANAMASGARD
Copy link
Contributor

FIXES #267

  • LFW dataset downloads now have improved reliability through fallback URL mechanism and retry logic.

Changes

  • Added download_with_fallback() helper function in R/utils.R that:

    • Tries multiple mirror URLs in sequence
    • Implements configurable retry logic (default: 2 retries per URL)
    • Verifies MD5 checksums when provided
    • Provides informative error messages with manual download instructions when all mirrors fail
  • Updated lfw_people_dataset and lfw_pairs_dataset in R/dataset-lfw.R:

    • Restructured URL configuration to support multiple mirrors (base_urls list)
    • Updated resource definitions with structured format (file_ids, md5)
    • Modified download() method to use the new fallback mechanism
  • Added tests/testthat/test-lfw-reliability.R with tests for:

    • Function existence and behavior
    • Error message quality
    • Dataset constructor validation

Testing

  • All existing LFW tests continue to pass
  • New reliability tests pass
  • Package builds and checks successfully

Notes

  • Currently uses Figshare as the primary mirror
  • Architecture supports adding additional mirrors in future if identified
  • Maintains backward compatibility with existing dataset API

@cregouby
Copy link
Collaborator

Hello @ANAMASGARD,

Thanks for the proposal.

Your provided alternative URL is not opened, as I get a trying to open it :

<Error>
<Code>AccessDenied</Code>
<Message>Access denied.</Message>
<Details>Anonymous caller does not have storage.objects.get access to the Google Cloud Storage object. Permission 'storage.objects.get' denied on resource (or it may not exist).</Details>
</Error>

Which means switching to download_with_fallback() might be of no use...

@ANAMASGARD
Copy link
Contributor Author

ANAMASGARD commented Dec 11, 2025

Thanks for the feedback @cregouby! You're right - I've removed the non-functional Kaggle URL in the latest commit.

The main value of this PR is now focused on improving download reliability for the existing Figshare URL:

  • Retry logic - Each URL is tried 2 times with a configurable delay before failing (handles transient network issues like timeouts that were causing CI failures)
  • Better error messages - When downloads fail, users get:
  1. List of URLs that were attempted
  2. Clear instructions for manual download with the cache directory path
  3. The actual error message for debugging
  • MD5 verification - Validates downloaded files aren't corrupted before extraction
  • Future-proof infrastructure - The base_urls list makes it easy to add additional mirrors when reliable ones are identified
    Even with a single mirror, the retry mechanism should help reduce the intermittent download failures that were causing test flakiness. Would you like me to make any adjustments to this approach?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

lfw_ dataset do not provide sufficent reliability

2 participants