-
Notifications
You must be signed in to change notification settings - Fork 3.2k
feat: add option to disable timing metadata in ASR transcription #2622
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
base: main
Are you sure you want to change the base?
Conversation
|
✅ DCO Check Passed Thanks @akanshajain231999, all your commits are properly signed off. 🎉 |
Merge ProtectionsYour pull request matches the following merge protections and will not be merged until they are valid. 🟢 Enforce conventional commitWonderful, this rule succeeded.Make sure that we follow https://www.conventionalcommits.org/en/v1.0.0/
|
….com> I, akanshajain231999 <[email protected]>, hereby add my Signed-off-by to this commit: b0f6e50 Signed-off-by: akanshajain231999 <[email protected]>
|
Hey @ceberam , Can you please review this PR? |
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
ceberam
left a comment
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@akanshajain231999 Thanks a lot for your contribution and detailed documentation and testing of this PR.
In terms of design:
- This PR is intended to address the issue #2564 , Would be nice to have an argument to disable printing the timing metadata.. It looks to me as a reasonable request to customize the text serialization of an audio file parsed as
DoclingDocument. One may want to extract the text from theDoclingDocumentwith the time metadata or without it. Note that your implementation is not addressing a serialization option but rather the parsing of the raw file intoDoclingDocument. If a user sets the pipeline optioninclude_time_metadata = False, the time metadata information will be lost and the convertedDoclingDocumentwill not hold this information. I think it would be more flexible thatDoclingDocumentkeeps the information and the user can then decide whether to extract/export all the information or just the text (without time metadata). Some alternatives could be:- keep everything in
TextItem.orig(untreated representation) and only the text intext(sanitized representation). Note that text export formats like markdown will usetextto serialize. This option would remove the hassle of dealing with pipeline options. - we are currently working in a new data model for audio provenance items in
docling-core. The metadata information would be stored there and it would help manage metadata information from ASR and WebVTT files, as well as unlocking time-dependent chunking. Please see my final remark later on.
- keep everything in
- I have the impression that issue #2564 was about any type of metadata in ASR, i.e., both the time and the speaker metadata. In this PR, the option
include_time_metadata = Falsewould still keep the speaker annotation, which may also be annoying for those who just want to process text.
More technically:
- Name of the CLI option: I find a bit confusing
--asr-no-timing, because of the double negation of the explicit False option--no-asr-no-timing. I would have simply used--asr-metadataand--no-asr-metadata(and included the speaker annotation like explained above). - In test files, please try to avoid converting the same file multiple times across the same test module (e.g., like in
test_asr_pipeline_with_time_metadata_default), not to make the test suite unnecessary longer. You can use fixtures with amodulescope.
Since we do not want to introduce features that may be deprecated in the short term, please allow us some few days (expected next week) and we will get back with further suggestions on this PR.
|
@ceberam Thanks for the detailed review. I will wait for your response until next week. |
@akanshajain231999 you are very welcome to contribute to Docling, this is an open-source collaborative project 🙂 |
Feature: Option to disable timing metadata during ASR transcription
This PR adds a new feature that allows users to disable the printing of timing metadata in ASR (Automatic Speech Recognition) transcription output. By default, timing information like
[time: 0.0-2.5]is included in the transcribed text, but users can now opt out of this behavior.Changes Made:
Core Implementation:
include_time_metadata: bool = Truefield toInlineAsrOptionsclass indocling/datamodel/pipeline_options_asr_model.py_ConversationItem.to_string()method to accept aninclude_time_metadataparameter_NativeWhisperModeland_MlxWhisperModelto respect the new settingCLI Integration:
--asr-no-timingflag to disable timing metadata via CLITests:
test_asr_pipeline_without_time_metadata()- verifies timing metadata can be disabledtest_asr_pipeline_with_time_metadata_default()- verifies timing metadata is enabled by defaulttest_conversation_item_to_string_with_and_without_time()- unit tests for theto_string()methodBackward Compatibility:
Usage Examples
Programmatic:
CLI:
Issue resolved by this Pull Request: Resolves #2564
Screenshot:
Checklist: