-
Notifications
You must be signed in to change notification settings - Fork 704
Probe/zero width bad char injection #1489
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Probe/zero width bad char injection #1489
Conversation
|
DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅ |
|
I have read the DCO Document and I hereby sign the DCO |
|
recheck |
|
Wow, thank you! Will take a look. NB There are failing tests - could you address these? |
|
I’ll dig into the failing tests and push a fix shortly. |
Failure 1 (docstring assertion — fixed)The test_probes.py expects every probe docstring to have at least two paragraphs (summary + detail) separated by a blank line. I’ve fixed this by expanding the docstring into a proper two-paragraph form. Failure 2 (langprovider call count mismatch)@leondz, need some help here- BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings. Because of that, Probe.probe() takes the “conversation” branch, which calls langprovider.get_text once per prompt/turn instead of batching all prompts into a single call. We still do one reverse-translation per attempt, and the test’s mock is attached to the same Passthru instance for both directions. With 256 prompts, this yields 256 forward + 256 reverse calls (512 total), while the test assumes the string/batched path and expects len(prompts)+1 (257) calls. |
This is the reason the test is failing, the test can to be updated to account for exercising this new calculation, that can be accomplished by simply adding detection for the |
erickgalinkin
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks great, thanks! A few places where I need some clarification. Beyond that, self.prompts should be list[str] and the conversation creation is handled in _mint_attempt. This should save us some work. I'll need to do some more local testing as well.
| return "".join(rendered) | ||
|
|
||
|
|
||
| def _load_homoglyph_map() -> dict[str, List[str]]: |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Does it make more sense to just turn intentional.txt into a json file so we can load it from disk without all the extra file parsing?
I don't see anywhere the .txt file is used directly.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
intentional.txt is the upstream Unicode Security format (https://www.unicode.org/Public/security/latest/intentional.txt), so we can drop in updates directly without maintaining a parallel generated artifact. Parsing is minimal (split on ; / #) and only happens once at init, so there isn’t much overhead to avoid. Converting this to JSON would add a regeneration step, but I can switch to JSON if that’s what’s expected/preferred.
|
|
||
| probe_instance.probe(generator_instance) | ||
|
|
||
| expected_provision_calls = len(probe_instance.prompts) + 1 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why are we modifying this?
| if text in self._seen_prompts: | ||
| return | ||
| self._seen_prompts.add(text) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Should this filter when the same text is offered with different metadata?
If not then this could also be simplified by using a set() for self.prompts and just creating the Conversation from the param then letting the set logic utilize comparison to deduplicate fully unique Conversation objects.
While sightly more runtime cost would be generated the long term memory savings in not holding onto self._seen_prompts for the entire lifecycle of the probe might be a fair trade.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This guard is intentionally text-only: multiple variant generators can emit the same string (e.g., different reorder ops landing on identical payloads), and we only want to send each unique prompt once. In that case we keep the first metadata/notes and drop subsequent duplicates.
If that is not the behaviour you were expecting, I can refactor to use a set() of Conversation and follow the approach you suggested instead.
leondz
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is great. A few questions and comments but in good shape.
This PR adds a new
BadCharactersprobe that exercises models with imperceptible and structurally tricky Unicode “bad character” perturbations reported in [#233].The probe:
Enumerates prompt variants that include:
Pre-generates the full Cartesian set of prompt variations and then downsamples the pool using
run.soft_probe_prompt_cap, mirroring the existingprobes.phrasing.*pattern so runs stay inference-friendly and reproducible.Key parameters (with defaults):
payload_name:"harmful_behaviors"perturbation_budget:1enabled_categories:["invisible", "homoglyph", "reordering", "deletion"]max_position_candidates:24max_reorder_candidates:24max_ascii_variants:len(ASCII_PRINTABLE)follow_prompt_cap:True(honorsoft_probe_prompt_cap)These can be tuned per run.
“Just Try Everything” Strategy
The probe explicitly implements the “just try everything” approach:
For a given payload and category, it creates all combinations within the configured budget:
DEFAULT_INVISIBLEcharacters._Swapobjects rendered via_render_swaps.This means the full search space is enumerated up front, and Downsampling with
soft_probe_prompt_capVerification
List the steps needed to make sure this thing works
python -m pytest tests/probes/test_probes_badcharacters.py