-
Notifications
You must be signed in to change notification settings - Fork 0
Open
Labels
enhancementNew feature or requestNew feature or requestpriority:mediumMedium priority taskMedium priority task
Description
Overview
Add support for legacy Microsoft Office formats (DOC, XLS, PPT) using OLE2 Compound File Binary Format.
Parent Epic
Part of #91 - Document & Office Format Awareness
Description
Parse OLE2 structured storage to extract document properties, embedded text, and metadata from legacy Office files.
Implementation Details
- Use
cfbcrate for OLE2 parsing - Extract Document Summary Information
- Parse specific streams (WordDocument, Workbook, PowerPoint)
- Handle embedded objects
- Extract VBA macro code (as strings, not execute)
String Sources
- Document properties (title, author, company, keywords)
- Summary information
- Embedded text (where accessible)
- VBA macro source code
- Hyperlinks
- Embedded object metadata
Acceptance Criteria
- Parse OLE2 structure
- Extract document properties
- Identify DOC, XLS, PPT streams
- Extract accessible text
- Handle VBA macro storage
- Skip binary data sections
- Tests with Office 97-2003 files
Note
Modern Office files (DOCX, XLSX, PPTX) are ZIP-based and covered in Phase 2.
Related
Project: #76
Metadata
Metadata
Assignees
Labels
enhancementNew feature or requestNew feature or requestpriority:mediumMedium priority taskMedium priority task