Probe/zero width bad char injection #1489

Har1sh-k · 2025-11-17T17:43:20Z

This PR adds a new BadCharacters probe that exercises models with imperceptible and structurally tricky Unicode “bad character” perturbations reported in [#233].

The probe:

Enumerates prompt variants that include:
- Invisible Unicode characters
- Homoglyph substitutions
- Bidi-based reordering control characters
- ASCII + backspace deletion pairs
Pre-generates the full Cartesian set of prompt variations and then downsamples the pool using run.soft_probe_prompt_cap, mirroring the existing probes.phrasing.* pattern so runs stay inference-friendly and reproducible.

Key parameters (with defaults):

payload_name: "harmful_behaviors"
perturbation_budget: 1
enabled_categories: ["invisible", "homoglyph", "reordering", "deletion"]
max_position_candidates: 24
max_reorder_candidates: 24
max_ascii_variants: len(ASCII_PRINTABLE)
follow_prompt_cap: True (honor soft_probe_prompt_cap)

These can be tuned per run.

“Just Try Everything” Strategy

The probe explicitly implements the “just try everything” approach:

For a given payload and category, it creates all combinations within the configured budget:
- Invisibles: combinations of positions × DEFAULT_INVISIBLE characters.
- Homoglyphs: combinations of positions × homoglyph options.
- Reordering: combinations of swap indices with bidi-wrapped _Swap objects rendered via _render_swaps.
- Deletion: combinations of positions × sampled ASCII characters with backspace.

This means the full search space is enumerated up front, and Downsampling with soft_probe_prompt_cap

Verification

List the steps needed to make sure this thing works

Run targeted probe unit tests python -m pytest tests/probes/test_probes_badcharacters.py
Verify the thing does what it should
Verify the thing does not do what it should not

github-actions · 2025-11-17T17:43:34Z

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

Har1sh-k · 2025-11-17T17:43:56Z

I have read the DCO Document and I hereby sign the DCO

Har1sh-k · 2025-11-17T17:47:10Z

recheck

leondz · 2025-11-17T18:53:40Z

Wow, thank you! Will take a look. NB There are failing tests - could you address these?

Har1sh-k · 2025-11-17T19:05:41Z

I’ll dig into the failing tests and push a fix shortly.

Har1sh-k · 2025-11-17T21:59:09Z

Failure 1 (docstring assertion — fixed)

The test_probes.py expects every probe docstring to have at least two paragraphs (summary + detail) separated by a blank line. I’ve fixed this by expanding the docstring into a proper two-paragraph form.

Failure 2 (langprovider call count mismatch)

@leondz, need some help here-

BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings. Because of that, Probe.probe() takes the “conversation” branch, which calls langprovider.get_text once per prompt/turn instead of batching all prompts into a single call. We still do one reverse-translation per attempt, and the test’s mock is attached to the same Passthru instance for both directions. With 256 prompts, this yields 256 forward + 256 reverse calls (512 total), while the test assumes the string/batched path and expects len(prompts)+1 (257) calls.
In short, the failure is due to the test’s call-count assumption not matching this probe’s conversation-based prompt flow; we either need to adjust the test to handle conversation prompts or change BadCharacters to batch-translate its conversations.

jmartin-tech · 2025-11-17T23:16:24Z

BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings.

This is the reason the test is failing, the test can to be updated to account for exercising this new calculation, that can be accomplished by simply adding detection for the prompts type and updating the test to match the expectations for this type of prompt in being used in this new probe while ensuring if the prompts are of type str simply expecting a single batch call to be made.

…docstring

erickgalinkin

Overall looks great, thanks! A few places where I need some clarification. Beyond that, self.prompts should be list[str] and the conversation creation is handled in _mint_attempt. This should save us some work. I'll need to do some more local testing as well.

garak/probes/badchars.py

erickgalinkin · 2025-11-21T15:13:25Z

garak/probes/badchars.py

+    return "".join(rendered)
+
+
+def _load_homoglyph_map() -> dict[str, List[str]]:


Does it make more sense to just turn intentional.txt into a json file so we can load it from disk without all the extra file parsing?
I don't see anywhere the .txt file is used directly.

intentional.txt is the upstream Unicode Security format (https://www.unicode.org/Public/security/latest/intentional.txt), so we can drop in updates directly without maintaining a parallel generated artifact. Parsing is minimal (split on ; / #) and only happens once at init, so there isn’t much overhead to avoid. Converting this to JSON would add a regeneration step, but I can switch to JSON if that’s what’s expected/preferred.

garak/probes/badchars.py

erickgalinkin · 2025-11-21T21:52:06Z

tests/langservice/probes/test_probes_base.py


    probe_instance.probe(generator_instance)

-    expected_provision_calls = len(probe_instance.prompts) + 1


Why are we modifying this?

jmartin-tech · 2025-11-21T23:46:38Z

garak/probes/badchars.py

+        if text in self._seen_prompts:
+            return
+        self._seen_prompts.add(text)


Should this filter when the same text is offered with different metadata?

If not then this could also be simplified by using a set() for self.prompts and just creating the Conversation from the param then letting the set logic utilize comparison to deduplicate fully unique Conversation objects.

While sightly more runtime cost would be generated the long term memory savings in not holding onto self._seen_prompts for the entire lifecycle of the probe might be a fair trade.

This guard is intentionally text-only: multiple variant generators can emit the same string (e.g., different reorder ops landing on identical payloads), and we only want to send each unique prompt once. In that case we keep the first metadata/notes and drop subsequent duplicates.

If that is not the behaviour you were expecting, I can refactor to use a set() of Conversation and follow the approach you suggested instead.

leondz

This is great. A few questions and comments but in good shape.

garak/probes/badchars.py

Har1sh-k added 6 commits November 14, 2025 12:52

probe: added badcharacters module

6214ab1

probe: add data for bad characters

f17d4d0

update doc for badchars

83e0221

add bad char in readme

1e62f8e

probe: add tests for bad char

a4c8c84

probe: bad characters formatting

8ddb505

github-actions bot added a commit that referenced this pull request Nov 17, 2025

@Har1sh-k has signed the CLA in #1489

2528e4a

jmartin-tech added the probes Content & activity of LLM probes label Nov 17, 2025

fix probe translation test for conversation prompts and expand probe …

fd94ed3

…docstring

leondz requested review from aishwaryap, erickgalinkin and leondz November 19, 2025 07:10

erickgalinkin requested changes Nov 21, 2025

View reviewed changes

jmartin-tech reviewed Nov 21, 2025

View reviewed changes

document bidi swap expansion ordering

b9e166d

Har1sh-k requested review from erickgalinkin and jmartin-tech November 26, 2025 17:48

leondz reviewed Nov 27, 2025

View reviewed changes

Har1sh-k added 2 commits December 4, 2025 18:39

document downsampling behavior

07fbd25

validate BadChar categories upfront

a1d6fa8

Har1sh-k requested a review from leondz December 5, 2025 01:01

leondz approved these changes Dec 5, 2025

View reviewed changes

garak/probes/badchars.py Show resolved Hide resolved

garak/probes/badchars.py Show resolved Hide resolved

garak/probes/badchars.py Show resolved Hide resolved

		return "".join(rendered)


		def _load_homoglyph_map() -> dict[str, List[str]]:


		probe_instance.probe(generator_instance)

		expected_provision_calls = len(probe_instance.prompts) + 1

Probe/zero width bad char injection #1489

Are you sure you want to change the base?

Probe/zero width bad char injection #1489

Uh oh!

Conversation

Har1sh-k commented Nov 17, 2025

These can be tuned per run.

“Just Try Everything” Strategy

Verification

Uh oh!

github-actions bot commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

Har1sh-k commented Nov 17, 2025

Uh oh!

Har1sh-k commented Nov 17, 2025

Uh oh!

leondz commented Nov 17, 2025

Uh oh!

Har1sh-k commented Nov 17, 2025

Uh oh!

Har1sh-k commented Nov 17, 2025

Failure 1 (docstring assertion — fixed)

Failure 2 (langprovider call count mismatch)

Uh oh!

jmartin-tech commented Nov 17, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

erickgalinkin left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

erickgalinkin Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Har1sh-k Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

erickgalinkin Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

jmartin-tech Nov 21, 2025

Choose a reason for hiding this comment

Uh oh!

Har1sh-k Nov 24, 2025

Choose a reason for hiding this comment

Uh oh!

leondz left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

github-actions bot commented Nov 17, 2025 •

edited

Loading

jmartin-tech commented Nov 17, 2025 •

edited

Loading