Skip to content

feat(parse): add multimodal parser semantic artifact MVP#1555

Open
yeyitech wants to merge 1 commit intovolcengine:mainfrom
yeyitech:feat/issue-372-multimodal-parser-mvp
Open

feat(parse): add multimodal parser semantic artifact MVP#1555
yeyitech wants to merge 1 commit intovolcengine:mainfrom
yeyitech:feat/issue-372-multimodal-parser-mvp

Conversation

@yeyitech
Copy link
Copy Markdown
Contributor

Summary

  • upgrade image/audio/video parsers to write semantic sidecars instead of metadata-only placeholder roots
  • reuse lightweight OCR/VLM/ASR-style helpers when available and emit graceful fallback text when they are not
  • add focused parser tests covering semantic artifacts and fallback outputs for multimodal media

Testing

  • python -m py_compile openviking/parse/parsers/media/image.py openviking/parse/parsers/media/audio.py openviking/parse/parsers/media/video.py tests/parse/test_media_parser_semantics.py
  • PYTEST_DISABLE_PLUGIN_AUTOLOAD=1 pytest --noconftest -q -o addopts= tests/parse/test_media_parser_semantics.py

@CLAassistant
Copy link
Copy Markdown

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Codex seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@github-actions
Copy link
Copy Markdown

PR Reviewer Guide 🔍

Here are some key observations to aid the review process:

⏱️ Estimated effort to review: 3 🔵🔵🔵⚪⚪
🏅 Score: 88
🧪 PR contains tests
🔒 No security concerns identified
✅ No TODO sections
🔀 Multiple PR themes

Sub-PR theme: Add semantic artifacts for image parser

Relevant files:

  • openviking/parse/parsers/media/image.py
  • tests/parse/test_media_parser_semantics.py

Sub-PR theme: Add semantic artifacts for audio parser

Relevant files:

  • openviking/parse/parsers/media/audio.py
  • tests/parse/test_media_parser_semantics.py

Sub-PR theme: Add semantic artifacts for video parser

Relevant files:

  • openviking/parse/parsers/media/video.py
  • tests/parse/test_media_parser_semantics.py

⚡ Recommended focus areas for review

Private API Usage

VideoParser calls private methods (_asr_transcribe_with_timestamps, _asr_transcribe, _vlm_describe) from AudioParser and ImageParser. This creates tight coupling and may break if those internal implementations change.

    audio_parser = AudioParser()
    timestamped = await audio_parser._asr_transcribe_with_timestamps(audio_bytes, None)
    if timestamped:
        return "timestamped", timestamped

    plain = await audio_parser._asr_transcribe(audio_bytes, None)
    if plain:
        return ("plain" if "unavailable" not in plain.lower() and "failed" not in plain.lower() else "unavailable"), plain
    return "unavailable", "Audio transcription unavailable: extracted audio track produced no transcript."

async def _extract_audio_track(self, file_path: Path) -> Optional[bytes]:
    """Extract a WAV audio track using ffmpeg when available."""
    ffmpeg_path = shutil.which("ffmpeg")
    if not ffmpeg_path:
        return None

    def _extract() -> Optional[bytes]:
        result = subprocess.run(
            [
                ffmpeg_path,
                "-y",
                "-i",
                str(file_path),
                "-vn",
                "-acodec",
                "pcm_s16le",
                "-ar",
                "16000",
                "-ac",
                "1",
                "-f",
                "wav",
                "pipe:1",
            ],
            capture_output=True,
            check=True,
        )
        return result.stdout or None

    try:
        return await asyncio.to_thread(_extract)
    except Exception as exc:
        logger.debug("ffmpeg audio extraction failed for %s: %s", file_path, exc)
        return None

async def _describe_preview_frame(
    self,
    preview_bytes: bytes,
    instruction: str = "",
) -> Optional[str]:
    """Describe the preview frame through the existing image VLM helper."""
    image_parser = ImageParser()
    if not image_parser._is_vlm_available():
        return None
    description = await image_parser._vlm_describe(
        preview_bytes,
        model=None,
        instruction=instruction,
    )
Unvalidated Magic Bytes

Audio signature validation uses b"\x00\x00\x00" as a magic byte for .m4a, which will incorrectly match any file starting with three null bytes, regardless of actual format.

    ".m4a": [b"\x00\x00\x00", b"ftypM4A", b"ftypisom"],
    ".opus": [b"OggS"],
}

ext_lower = file_path.suffix.lower()
magic_list = audio_magic_bytes.get(ext_lower, [])
if not any(
    len(audio_bytes) >= len(magic) and audio_bytes.startswith(magic) for magic in magic_list
):
    raise ValueError(
        f"Invalid audio file: {file_path}. "
        f"File signature does not match expected format {ext_lower}"
    )
Unvalidated Magic Bytes

Video signature validation uses b"\x00\x00\x00" as a magic byte for .mp4/.mov, which will incorrectly match any file starting with three null bytes, regardless of actual format.

video_magic_bytes = {
    ".mp4": [b"\x00\x00\x00", b"ftyp"],
    ".avi": [b"RIFF"],
    ".mov": [b"\x00\x00\x00", b"ftyp"],
    ".mkv": [b"\x1a\x45\xdf\xa3"],
    ".webm": [b"\x1a\x45\xdf\xa3"],
    ".flv": [b"FLV"],
    ".wmv": [b"\x30\x26\xb2\x75\x8e\x66\xcf\x11"],
}

ext_lower = file_path.suffix.lower()
magic_list = video_magic_bytes.get(ext_lower, [])
if not any(
    len(video_bytes) >= len(magic) and video_bytes.startswith(magic) for magic in magic_list
):
    raise ValueError(
        f"Invalid video file: {file_path}. "
        f"File signature does not match expected format {ext_lower}"
    )

@github-actions
Copy link
Copy Markdown

PR Code Suggestions ✨

Explore these optional code suggestions:

CategorySuggestion                                                                                                                                    Impact
Api resilience
Add timeouts to OpenAI ASR requests

Add explicit timeout configuration to the OpenAI client and transcription requests
to avoid hanging indefinitely. The OpenAI Python client supports timeouts at both
the client and request levels.

openviking/parse/parsers/media/audio.py [310-319]

+client_kwargs["timeout"] = 60.0  # Add default timeout for all client requests
 client = openai.OpenAI(**client_kwargs)
 with tempfile.NamedTemporaryFile(mode="wb", suffix=".wav", delete=False) as temp_file:
     temp_file.write(audio_bytes)
     temp_file_path = temp_file.name
 
 with open(temp_file_path, "rb") as audio_file:
     response = client.audio.transcriptions.create(
         model=model_name,
         file=audio_file,
         language=self.config.language,
+        timeout=120.0,  # Add explicit request-level timeout for transcription
     )
Suggestion importance[1-10]: 5

__

Why: Adding timeouts to OpenAI client and requests improves resilience by avoiding indefinite hangs. This is a moderate robustness improvement with clear value.

Low

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

Status: Backlog

Development

Successfully merging this pull request may close these issues.

2 participants