Skip to content

Conversation

@Har1sh-k
Copy link

This PR adds a new BadCharacters probe that exercises models with imperceptible and structurally tricky Unicode “bad character” perturbations reported in [#233].

The probe:

  • Enumerates prompt variants that include:

    • Invisible Unicode characters
    • Homoglyph substitutions
    • Bidi-based reordering control characters
    • ASCII + backspace deletion pairs
  • Pre-generates the full Cartesian set of prompt variations and then downsamples the pool using run.soft_probe_prompt_cap, mirroring the existing probes.phrasing.* pattern so runs stay inference-friendly and reproducible.

Key parameters (with defaults):

  • payload_name: "harmful_behaviors"
  • perturbation_budget: 1
  • enabled_categories: ["invisible", "homoglyph", "reordering", "deletion"]
  • max_position_candidates: 24
  • max_reorder_candidates: 24
  • max_ascii_variants: len(ASCII_PRINTABLE)
  • follow_prompt_cap: True (honor soft_probe_prompt_cap)

These can be tuned per run.

“Just Try Everything” Strategy

The probe explicitly implements the “just try everything” approach:

  • For a given payload and category, it creates all combinations within the configured budget:

    • Invisibles: combinations of positions × DEFAULT_INVISIBLE characters.
    • Homoglyphs: combinations of positions × homoglyph options.
    • Reordering: combinations of swap indices with bidi-wrapped _Swap objects rendered via _render_swaps.
    • Deletion: combinations of positions × sampled ASCII characters with backspace.

This means the full search space is enumerated up front, and Downsampling with soft_probe_prompt_cap


Verification

List the steps needed to make sure this thing works

  • Run targeted probe unit tests python -m pytest tests/probes/test_probes_badcharacters.py
  • Verify the thing does what it should
  • Verify the thing does not do what it should not

@github-actions
Copy link
Contributor

github-actions bot commented Nov 17, 2025

DCO Assistant Lite bot All contributors have signed the DCO ✍️ ✅

@Har1sh-k
Copy link
Author

I have read the DCO Document and I hereby sign the DCO

@Har1sh-k
Copy link
Author

recheck

github-actions bot added a commit that referenced this pull request Nov 17, 2025
@leondz
Copy link
Collaborator

leondz commented Nov 17, 2025

Wow, thank you! Will take a look. NB There are failing tests - could you address these?

@Har1sh-k
Copy link
Author

I’ll dig into the failing tests and push a fix shortly.

@Har1sh-k
Copy link
Author

Failure 1 (docstring assertion — fixed)

The test_probes.py expects every probe docstring to have at least two paragraphs (summary + detail) separated by a blank line. I’ve fixed this by expanding the docstring into a proper two-paragraph form.

Failure 2 (langprovider call count mismatch)

@leondz, need some help here-

BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings. Because of that, Probe.probe() takes the “conversation” branch, which calls langprovider.get_text once per prompt/turn instead of batching all prompts into a single call. We still do one reverse-translation per attempt, and the test’s mock is attached to the same Passthru instance for both directions. With 256 prompts, this yields 256 forward + 256 reverse calls (512 total), while the test assumes the string/batched path and expects len(prompts)+1 (257) calls.
In short, the failure is due to the test’s call-count assumption not matching this probe’s conversation-based prompt flow; we either need to adjust the test to handle conversation prompts or change BadCharacters to batch-translate its conversations.

@jmartin-tech jmartin-tech added the probes Content & activity of LLM probes label Nov 17, 2025
@jmartin-tech
Copy link
Collaborator

jmartin-tech commented Nov 17, 2025

BadCharacters is the only probe in this test whose prompts are stored as garak.attempt.Conversation objects instead of raw strings.

This is the reason the test is failing, the test can to be updated to account for exercising this new calculation, that can be accomplished by simply adding detection for the prompts type and updating the test to match the expectations for this type of prompt in being used in this new probe while ensuring if the prompts are of type str simply expecting a single batch call to be made.

Copy link
Collaborator

@erickgalinkin erickgalinkin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Overall looks great, thanks! A few places where I need some clarification. Beyond that, self.prompts should be list[str] and the conversation creation is handled in _mint_attempt. This should save us some work. I'll need to do some more local testing as well.

return "".join(rendered)


def _load_homoglyph_map() -> dict[str, List[str]]:
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does it make more sense to just turn intentional.txt into a json file so we can load it from disk without all the extra file parsing?
I don't see anywhere the .txt file is used directly.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

intentional.txt is the upstream Unicode Security format (https://www.unicode.org/Public/security/latest/intentional.txt), so we can drop in updates directly without maintaining a parallel generated artifact. Parsing is minimal (split on ; / #) and only happens once at init, so there isn’t much overhead to avoid. Converting this to JSON would add a regeneration step, but I can switch to JSON if that’s what’s expected/preferred.


probe_instance.probe(generator_instance)

expected_provision_calls = len(probe_instance.prompts) + 1
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why are we modifying this?

Comment on lines +183 to +185
if text in self._seen_prompts:
return
self._seen_prompts.add(text)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Should this filter when the same text is offered with different metadata?

If not then this could also be simplified by using a set() for self.prompts and just creating the Conversation from the param then letting the set logic utilize comparison to deduplicate fully unique Conversation objects.

While sightly more runtime cost would be generated the long term memory savings in not holding onto self._seen_prompts for the entire lifecycle of the probe might be a fair trade.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This guard is intentionally text-only: multiple variant generators can emit the same string (e.g., different reorder ops landing on identical payloads), and we only want to send each unique prompt once. In that case we keep the first metadata/notes and drop subsequent duplicates.

If that is not the behaviour you were expecting, I can refactor to use a set() of Conversation and follow the approach you suggested instead.

Copy link
Collaborator

@leondz leondz left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is great. A few questions and comments but in good shape.

@Har1sh-k Har1sh-k requested a review from leondz December 5, 2025 01:01
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

probes Content & activity of LLM probes

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants