Skip to content

Implement FineWeb quality filter and update helpers#3

Merged
kris927b merged 4 commits into
mainfrom
feat/fineweb-quality-filter
Jun 12, 2025
Merged

Implement FineWeb quality filter and update helpers#3
kris927b merged 4 commits into
mainfrom
feat/fineweb-quality-filter

Conversation

@kris927b

@kris927b kris927b commented Jun 9, 2025

Copy link
Copy Markdown
Owner

This commit introduces the FineWebQualityFilterImpl, a Rust implementation of the FineWeb quality filtering algorithm, replacing a previous placeholder filter that checked a metadata score.

Key changes:

  • Added FineWebQualityFilterImpl in src/pipeline/filters/fineweb_quality.rs with configurable parameters (line punctuation, short lines, character duplicates, newline ratio) and logic based on the reference Python version.
  • Updated src/utils/text.rs:
    • Corrected the split_into_words function for accurate word segmentation using ICU.
    • Implemented the find_duplicates function to calculate character duplication ratios.
  • Added comprehensive unit tests for FineWebQualityFilterImpl, covering various scenarios and edge cases for each filtering criterion.
  • Updated filter registration in src/pipeline/filters/mod.rs.
  • Updated src/config.rs to define FineWebQualityFilterImplParams for YAML configuration and updated the StepConfig enum.
  • Updated src/bin/worker.rs to correctly instantiate and integrate the new filter into the pipeline using the new parameters.

All existing and new tests pass, ensuring the filter behaves as expected and integrates correctly with the existing pipeline infrastructure.

google-labs-jules Bot and others added 4 commits June 9, 2025 17:34
This commit introduces the `FineWebQualityFilterImpl`, a Rust implementation
of the FineWeb quality filtering algorithm, replacing a previous placeholder
filter that checked a metadata score.

Key changes:
- Added `FineWebQualityFilterImpl` in `src/pipeline/filters/fineweb_quality.rs`
  with configurable parameters (line punctuation, short lines, character
  duplicates, newline ratio) and logic based on the reference Python version.
- Updated `src/utils/text.rs`:
    - Corrected the `split_into_words` function for accurate word segmentation
      using ICU.
    - Implemented the `find_duplicates` function to calculate character
      duplication ratios.
- Added comprehensive unit tests for `FineWebQualityFilterImpl`, covering
  various scenarios and edge cases for each filtering criterion.
- Updated filter registration in `src/pipeline/filters/mod.rs`.
- Updated `src/config.rs` to define `FineWebQualityFilterImplParams` for
  YAML configuration and updated the `StepConfig` enum.
- Updated `src/bin/worker.rs` to correctly instantiate and integrate
  the new filter into the pipeline using the new parameters.

All existing and new tests pass, ensuring the filter behaves as expected
and integrates correctly with the existing pipeline infrastructure.
@kris927b kris927b merged commit 9ef07b6 into main Jun 12, 2025
1 check passed
@kris927b kris927b deleted the feat/fineweb-quality-filter branch June 12, 2025 06:42
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant