Skip to content

fix(prompts): keep semantic summaries in source language#1521

Open
yeyitech wants to merge 2 commits intovolcengine:mainfrom
yeyitech:fix/issue-1067-same-language-semantic
Open

fix(prompts): keep semantic summaries in source language#1521
yeyitech wants to merge 2 commits intovolcengine:mainfrom
yeyitech:fix/issue-1067-same-language-semantic

Conversation

@yeyitech
Copy link
Copy Markdown
Contributor

Summary

  • strengthen semantic summary and overview prompts so output follows the dominant source language instead of drifting to English
  • keep file names, API names, and identifiers in their original form while requiring explanatory prose to follow output_language
  • add focused prompt-contract tests for file/document summary and overview generation templates

Closes #1067

Testing

  • pytest tests/storage/test_semantic_processor_language.py (blocked by repo-side pytest-asyncio collection issue in this environment)
  • PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest -o addopts='' tests/storage/test_semantic_processor_language.py::TestSemanticPromptLanguageContract tests/storage/test_semantic_processor_language.py::TestOverviewGenerationFlow

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

🎫 Ticket compliance analysis 🔶

1067 - Partially compliant

Compliant requirements:

  • All three affected semantic prompt templates are updated with language requirements
  • Added language instructions to keep summaries in the source content's dominant language
  • Added tests to verify the language contract in the prompts
  • Preserves file names, API names, and identifiers while requiring prose to follow the output language

Non-compliant requirements:

  • No requirements are left unfulfilled

Requires further human verification:

  • No items require further human verification
⏱️ Estimated effort to review: 2 🔵🔵⚪⚪⚪
🏅 Score: 95
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 No multiple PR themes
⚡ No major issues detected

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
General
Extract repeated prompt section to shared snippet

The language requirements section is repeated almost verbatim across multiple prompt
templates. Extract this into a shared Jinja2 template snippet to avoid duplication
and ensure consistency.

openviking/prompts/templates/semantic/document_summary.yaml [36-40]

-Language requirements:
-- Write the summary in {{ output_language }}, matching the dominant natural language used in the source content
-- Do not default to English unless the source content is predominantly English
-- If the content mixes languages, follow the dominant human-language prose and use it consistently for the whole summary
-- Keep unavoidable file names, code identifiers, API names, and quoted literals in their original form, but keep all explanatory prose in {{ output_language }}
+{% include 'semantic/language_requirements_shared.yaml' %}
Suggestion importance[1-10]: 5

__

Why: The suggestion correctly identifies repeated language requirements sections across multiple prompt templates. Extracting them to a shared Jinja2 snippet would improve maintainability and consistency, making this a moderate-impact improvement.

Low

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you all sign our Contributor License Agreement before we can accept your contribution.
1 out of 2 committers have signed the CLA.

✅ yeyitech
❌ Codex


Codex seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

Bug: VLM prompt templates force English output regardless of content language

2 participants