Skip to content

Feature: Microsoft Office (OLE2) format support #93

@coderabbitai

Description

@coderabbitai

Overview

Add support for legacy Microsoft Office formats (DOC, XLS, PPT) using OLE2 Compound File Binary Format.

Parent Epic

Part of #91 - Document & Office Format Awareness

Description

Parse OLE2 structured storage to extract document properties, embedded text, and metadata from legacy Office files.

Implementation Details

  • Use cfb crate for OLE2 parsing
  • Extract Document Summary Information
  • Parse specific streams (WordDocument, Workbook, PowerPoint)
  • Handle embedded objects
  • Extract VBA macro code (as strings, not execute)

String Sources

  • Document properties (title, author, company, keywords)
  • Summary information
  • Embedded text (where accessible)
  • VBA macro source code
  • Hyperlinks
  • Embedded object metadata

Acceptance Criteria

  • Parse OLE2 structure
  • Extract document properties
  • Identify DOC, XLS, PPT streams
  • Extract accessible text
  • Handle VBA macro storage
  • Skip binary data sections
  • Tests with Office 97-2003 files

Note

Modern Office files (DOCX, XLSX, PPTX) are ZIP-based and covered in Phase 2.

Related

Project: #76

Metadata

Metadata

Assignees

Labels

Type

No type

Projects

No projects

Milestone

No milestone

Relationships

None yet

Development

No branches or pull requests

Issue actions