Skip to content

FuzzyDeduplicationWorkflow crashes due to multiple bugs #1228

@jmcmanus15

Description

@jmcmanus15

See Issue #1216, where I first reported the crashes.

To reproduce:

  1. Run the nvcr.io/nvidia/nemo-curator:25.09 docker image
  2. Inside the container, execute the code example in the documentation: https://docs.nvidia.com/nemo/curator/latest/curate-text/process-data/deduplication/index.html. You have to start the ray client before running the workflow (which isn't described in the docs but is consistent with the semantic dedup tutorial).

That's it. I can't share the data I'm using because they're proprietary, but you can choose any data you like. I stepped through the code in the debugger and investigated debugging prints. The bugs are unrelated to my data.

See Issue #1216 for details about all the things that go wrong when you attempt the above. It fails immediately. If you circumvent one problem, you immediately find another. I can't even get the first stage (MinHash) to succeed, much less the full workflow.

I'd consider this issue resolved if someone can produce a single example that actually works with FuzzyDeduplicationWorkflow. That's all I want at this point.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions