-
Notifications
You must be signed in to change notification settings - Fork 187
Closed
Labels
Description
See Issue #1216, where I first reported the crashes.
To reproduce:
- Run the nvcr.io/nvidia/nemo-curator:25.09 docker image
- Inside the container, execute the code example in the documentation: https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html. You have to start the ray client before running the workflow (which isn't described in the docs but is consistent with the semantic dedup tutorial).
That's it. I can't share the data I'm using because they're proprietary, but you can choose any data you like. I stepped through the code in the debugger and investigated debugging prints. The bugs are unrelated to my data.
See Issue #1216 for details about all the things that go wrong when you attempt the above. It fails immediately. If you circumvent one problem, you immediately find another. I can't even get the first stage (MinHash) to succeed, much less the full workflow.
I'd consider this issue resolved if someone can produce a single example that actually works with FuzzyDeduplicationWorkflow. That's all I want at this point.