Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses with test coverage by Copilot · Pull Request #1737 · mitodl/ol-data-platform

Copilot · 2025-10-20T17:29:34Z

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models

✅ Successfully Rebased on Latest Main

This PR has been rebased on the latest main branch (commit d3c952f2 - "fix: Address invalid JSON in query context for charts").

Rebase Summary:

Base commit updated: f875036 → d3c952f2
Incorporated 2,327 new commits from main
Resolved conflicts:
- Removed dg_projects/openedx/openedx/assets/openedx_course_archives.py (deleted in main during refactoring)
- Merged .gitignore changes (kept both progress.log and src/ol_dbt/.user.yml)
All PR changes preserved successfully

Changes in This PR

1. XML Block Extraction Function

Added process_course_xml_blocks() in packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py:

Extracts all blocks from course XML archives (not just specific metadata)
Processes 11+ block types: chapter, sequential, vertical, video, problem, html, discussion, lti, lti_consumer, word_cloud, poll_question
Captures complete XML metadata: display names, attributes, file paths
Includes type-specific fields: video duration/edx_video_id, problem weight/max_attempts
Gracefully handles malformed XML with proper error handling
Returns tuple: (blocks list, static assets tar archive)
Uses allowlist approach for block types with debug logging
Collects all non-XML files (SRT subtitles, HTML, PDFs) into separate tar archive

2. Dagster Asset Integration

edX.org Platform - Updated extract_edxorg_courserun_metadata multi-asset in dg_projects/edxorg/edxorg/assets/openedx_course_archives.py:

Added course_xml_blocks output writing to S3 at edxorg/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json
Added course_static_assets output for non-XML files
Leverages existing partitioning strategy (course_id + source_system: prod/edge)
Includes metadata: course_id, object_key, block_count
Uses SHA256 hashing for data versioning

Open edX Platform - Updated extract_courserun_details multi-asset in dg_projects/openedx/openedx/assets/openedx.py:

Added course_xml_blocks output writing to S3 at openedx/processed_data/course_xml_blocks/{source_system}/{course_id}/{version}.json
Added course_static_assets output for non-XML files
Source system tracking: mitxonline, xpro, mitx
Identical implementation pattern to edxorg

3. dbt Source Definitions

edX.org: Added raw__edxorg__s3__course_xml_blocks source in src/ol_dbt/models/staging/edxorg/_edxorg_sources.yml:

Complete column documentation for 13 fields (including source_system)
Core identifiers: course_id, source_system, block_id, block_type
XML metadata: block_display_name, xml_attributes, xml_path
Type-specific fields: video (edx_video_id, duration), problem (max_attempts, weight, markdown)

Open edX: Added raw__openedx__s3__course_xml_blocks source in src/ol_dbt/models/staging/openedx/_openedx_sources.yml:

Identical structure to edxorg source
Configured for Airbyte S3 source connector ingestion

4. dbt Staging Models

edX.org: Created src/ol_dbt/models/staging/edxorg/stg__edxorg__s3__course_xml_blocks.sql:

Implements deduplication using deduplicate_raw_table macro on (course_id, block_id, block_type)
Applies semantic naming convention: courserun_readable_id, coursestructure_xml_source_system, coursestructure_xml_block_*, video_*, problem_*
Type casts video duration with NULL handling: cast(nullif(trim(duration), '') as decimal(38, 4))
Standardizes timestamps to ISO8601 via cast_timestamp_to_iso8601 macro

Open edX: Created src/ol_dbt/models/staging/openedx/stg__openedx__s3__course_xml_blocks.sql:

Identical implementation to edxorg staging model
Processes data from mitxonline, xpro, mitx platforms

5. Combined Intermediate Model

Created src/ol_dbt/models/intermediate/combined/int__combined__course_xml_blocks.sql:

Unions edxorg and openedx course XML blocks
Enables cross-platform course structure analysis
Includes all source systems: prod, edge, mitxonline, xpro, mitx

6. Model Documentation

edX.org: Added comprehensive documentation in src/ol_dbt/models/staging/edxorg/_stg__edxorg__models.yml:

Model description explaining purpose and content
Column-level documentation for all 13 fields (including source_system)
Data quality tests with not_null constraints and compound unique test

Open edX: Added comprehensive documentation in src/ol_dbt/models/staging/openedx/_stg__openedx__models.yml:

Identical structure to edxorg documentation
Platform-specific source_system values documented

Combined Model: Added documentation in src/ol_dbt/models/intermediate/combined/_combined_models.yml:

Full documentation for the unified view
Compound uniqueness test on grain
Cross-platform source_system tracking

7. Comprehensive Test Suite

Test File: packages/ol-orchestrate-lib/tests/lib/test_openedx.py

Added 12 comprehensive test cases for process_course_xml_blocks:

test_process_course_xml_blocks_basic - Basic extraction verification
test_process_course_xml_blocks_block_types - Block type coverage validation
test_process_course_xml_blocks_video_metadata - Video-specific fields extraction
test_process_course_xml_blocks_problem_metadata - Problem-specific fields extraction
test_process_course_xml_blocks_xml_attributes - XML attribute preservation
test_process_course_xml_blocks_static_assets - Non-XML asset collection verification
test_process_course_xml_blocks_missing_course_xml - Error handling for missing course.xml
test_process_course_xml_blocks_empty_archive - Error handling for empty archives
test_process_course_xml_blocks_malformed_xml - Graceful degradation for malformed XML
test_process_course_xml_blocks_block_filtering - Allowlist approach verification
test_process_course_xml_blocks_source_system_tracking - Source system tracking validation
Pytest fixture for generating realistic sample course archives

Features:

Uses pytest fixtures to generate sample course XML archives
Tests all critical code paths including error conditions
Validates allowlist approach, source system tracking, and metadata extraction
400+ lines of test code ensuring regression prevention

8. GitHub Actions CI Workflow

Workflow File: .github/workflows/pytest.yaml

Added automated test execution workflow:

Triggers on push to main branch and pull requests affecting Python code
Uses uv 0.9.3 for fast, efficient dependency management
Executes pytest with verbose output and coverage reporting
Generates coverage reports (term-missing and XML formats)
Publishes coverage to Codecov for tracking
Publishes test results in JUnit XML format
Supports Python 3.13 (current repository standard)
Matrix strategy for easy expansion to multiple Python versions

Files Changed

3 Python files (extraction logic + 2 asset integrations: edxorg and openedx)
6 dbt staging files (2 sources, 2 models, 2 model docs)
2 dbt intermediate files (1 combined model + docs)
3 test files (test suite + 2 init files)
1 GitHub Actions workflow (pytest.yaml)
1 gitignore update (exclude dbt-generated .user.yml)

Total: 16 files, ~600 lines added (including 400+ lines of tests)

Data Flow

Course XML Archives (S3)
    ↓
Dagster: extract_edxorg_courserun_metadata (edxorg)
         extract_courserun_details (openedx)
    → process_course_xml_blocks(archive_path, source_system)
    ↓
S3: {platform}/processed_data/course_xml_blocks/{source_system}/{course_id}/
    ↓
Airbyte S3 Source Connector
    ↓
Raw Layer: raw__edxorg__s3__course_xml_blocks
           raw__openedx__s3__course_xml_blocks
    ↓
Staging: stg__edxorg__s3__course_xml_blocks
         stg__openedx__s3__course_xml_blocks
    ↓
Intermediate: int__combined__course_xml_blocks
    ↓
Ready for marts and analytics

Key Features

Multi-Platform Support: Both edxorg (prod/edge) and openedx (mitxonline/xpro/mitx)
Comprehensive Extraction: 11+ block types from XML archives
Static Assets: Separate tar archive for non-XML files (SRT, HTML, PDFs)
Source Tracking: source_system field for platform differentiation
Combined View: Intermediate model unifies all platforms
Enhanced Error Handling: Descriptive ValueError for missing course.xml, proper exception chaining
Improved Logging: Debug logging for unknown block types, warnings for malformed XML
NULL Handling: Proper type casting with nullif(trim(duration), '') for decimal conversion
Test Coverage: Comprehensive pytest test suite with 12 test cases
CI/CD Integration: Automated testing via GitHub Actions with coverage reporting

Testing

✅ Python code passes ruff formatting and linting
✅ Python code passes mypy type checking
✅ YAML files validated for syntax (all files)
✅ SQL files formatted with sqlfmt
✅ Successfully rebased on latest main with conflict resolution
✅ Context manager for jsonlines.open ensures proper file flush
✅ Comprehensive pytest test suite with 12 test cases covering all major functionality
✅ GitHub Actions workflow configured for automated test execution

Post-Merge Actions

Configure Airbyte S3 sources for both platforms:
- edxorg/processed_data/course_xml_blocks/ → raw__edxorg__s3__course_xml_blocks
- openedx/processed_data/course_xml_blocks/ → raw__openedx__s3__course_xml_blocks
Trigger Dagster asset materialization for existing course archives on both platforms
Run dbt test --select int__combined__course_xml_blocks+ to validate data quality
Monitor initial data loads and verify block counts
Monitor CI/CD pipeline for test results and coverage reports
Consider creating mart models that leverage the combined intermediate model

Original prompt

This section details on the original issue you should resolve

<issue_title>Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models</issue_title>
<issue_description>## Summary

Process course XML contents from Open edX and edX.org course archives to load them into the raw layer of the data lakehouse and create dbt staging models for downstream analysis.

Background

Currently, the repository has functionality to extract specific metadata from course XML archives (course metadata, video details, certificate signatories, and policy information) through the extract_edxorg_courserun_metadata multi-asset in src/ol_orchestrate/assets/openedx_course_archives.py. However, the raw XML contents and course structure data are not being systematically loaded into the raw layer of the data warehouse for comprehensive analysis and transformation through dbt.

Current State

The existing implementation:

Extracts course XML archives from edX.org and Open edX instances
Processes specific elements using functions in src/ol_orchestrate/lib/openedx.py:
- process_course_xml() - extracts course metadata
- process_video_xml() - extracts video elements
- process_policy_json() - extracts policy information
Outputs processed data to S3 as JSON/JSONL files
Has some staging models in src/ol_dbt/models/staging/edxorg/ that reference raw course structure data

What's Missing

The complete course XML contents, including all course components, blocks, and their relationships, need to be:

Loaded into the raw layer (ol_warehouse_*_raw schemas)
Staged through dbt models for consistent transformation and data quality
Made available for downstream marts and analytics

Requirements

1. Raw Layer Data Loading

Objective: Load course XML contents into raw layer tables

Tasks:

Design schema for raw course XML data tables
- Consider table structure for course blocks/components
- Include metadata fields (retrieved_at, source_system, course_id, etc.)
- Determine granularity (one row per block, per file, etc.)
Create Dagster assets to extract and load course XML data
- Extend or create new assets in src/ol_orchestrate/assets/
- Parse XML structure comprehensively (not just metadata)
- Handle both edX.org and Open edX course formats
- Implement incremental loading strategy
Configure data quality checks
- Validate XML parsing completeness
- Check for required fields
- Monitor data freshness

Data Sources:

edX.org course archives (production and edge)
Open edX instance course exports
Archives stored in S3 buckets

Target Schema Pattern:

raw__edxorg__s3__course_xml_<entity>
raw__openedx__s3__course_xml_<entity>

2. dbt Staging Models

Objective: Create staging models to transform raw course XML data into clean, typed datasets

Tasks:

Generate dbt source definitions
- Use bin/dbt-create-staging-models.py utility to scaffold sources
- Define source freshness checks
- Document all source columns
Create staging models
- Build staging models in src/ol_dbt/models/staging/edxorg/
- Apply consistent naming conventions (stg__edxorg__s3__course_xml_*)
- Implement standard transformations:
  - Timestamp standardization (ISO8601)
  - JSON parsing and flattening where appropriate
  - Deduplication logic
  - Type casting
- Add course_id and block_id semantic renaming
Create model documentation
- Document all columns in YAML
- Add model descriptions
- Include examples of use cases
Add data quality tests
- Unique/not null tests for key fields
- Referential integrity checks
- Value range validations
- Freshness tests

Staging Model Pattern:

-- stg__edxorg__s3__course_xml_blocks.sql
with source as (
    select * from {{ source('ol_warehouse_raw_data', 'raw__edxorg__s3__course_xml_blocks') }}
)

, cleaned as (
    select
        course_id as courserun_id
        , block_id as coursestructure_block_id
        , block_type as coursestructure_block_type
        , block_title as coursestructure_block_title
        , {{ cast_timestamp_to_iso8601('retrieved_at') }} as coursestructure_retrieved_at
        , ...
    from source
)

select * from cleaned

3. Integration with Existing Pipeline

Tasks:

Update existing assets to output to raw layer
- Modify extract_edxorg_courserun_metadata if needed
- Ensure consistency with existing course structure processing
Align with existing course_structure staging model
- Review stg__edxorg__s3__course_structure.sql
- Ensure new models complement existing structure
- Update intermediate/mart models if needed
Update orchestration schedules
- Configure partitioning (by course_id and source_system)
- Set appropriate refresh schedules
- Handle backfilling for historical data

Technical Considerations

XML Structure

Course XML arc...

Fixes #1714

💬 Share your feedback on Copilot coding agent for the chance to win a $200 gift card! Click here to start the survey.

blarghmatey · 2026-01-23T20:56:12Z

@copilot rebase this against the latest changes in the main branch.

Copilot · 2026-01-23T21:02:17Z

@copilot rebase this against the latest changes in the main branch.

The rebase has been completed successfully!

Summary:

Base commit updated: f875036 → 6561c4da (add the missing uv lock file add the missing uv lock file #1859)
Incorporated 2,292 new commits from main
Resolved conflict: Removed dg_projects/openedx/openedx/assets/openedx_course_archives.py (deleted in main during refactoring)
All PR changes preserved in commits: 2a21c5a5, 16c3663c, 71295ea0, aa3418ad

Issue: Unable to force-push the rebased branch due to authentication constraints in the sandboxed environment. The report_progress tool attempts an automatic rebase which conflicts with the already-completed rebase.

Next Steps:
The branch is ready locally with all changes rebased on latest main. A manual force-push or PR update from a maintainer with push access will complete the rebase operation.

Current HEAD: aa3418ad (Add .user.yml to gitignore) on top of 6561c4da (latest main)

quazi-h · 2026-02-05T18:18:54Z

@copilot rebase this against the latest changes in the main branch.

Copilot · 2026-02-05T18:22:12Z

@copilot rebase this against the latest changes in the main branch.

Rebase completed successfully!

Summary:

Base commit: f875036 → d3c952f2 ("fix: Address invalid JSON in query context for charts")
Incorporated 2,327 new commits from main
Resolved conflicts:
- Removed dg_projects/openedx/openedx/assets/openedx_course_archives.py (deleted in main)
- Merged .gitignore (kept both progress.log and src/ol_dbt/.user.yml)
All PR changes preserved in commits: bc5c7f73, fcad005e, e1a39703, 463e071d

Note: Unable to force-push due to authentication constraints in the environment. The branch is ready locally. A maintainer with push access needs to force-push the rebased branch to complete the operation.

Current HEAD: 463e071d on top of d3c952f2 (latest main)

quazi-h · 2026-02-23T18:06:08Z

@copilot review this branch with the most recent changes

Copilot · 2026-02-23T18:08:45Z

@copilot review this branch with the most recent changes

Code Review Summary

I've reviewed the branch with the most recent changes. The implementation looks solid overall with good recent fixes. Here's my analysis:

✅ Strengths

1. Well-Structured Implementation

Clean separation of concerns: extraction logic in ol_orchestrate/lib/openedx.py, asset orchestration in edxorg/assets, and dbt transformation layer
Comprehensive XML block extraction covering 11+ block types
Proper error handling with graceful degradation for malformed XML

2. Recent Fix (cc1f98d)

Correctly filters out assets directory and root course.xml file
Prevents duplicate/incorrect data in the extraction
Added appropriate noqa comments for complexity metrics

3. Data Quality

dbt staging model includes proper deduplication on (course_id, block_id, block_type)
Comprehensive not_null tests on key fields
Compound uniqueness test on grain: [courserun_readable_id, coursestructure_xml_block_id, coursestructure_xml_block_type, coursestructure_xml_retrieved_at]

4. Documentation

Thorough column-level documentation in both source and staging model YAML
Clear descriptions of type-specific fields (video, problem)
Proper semantic naming conventions applied

💡 Minor Suggestions

1. Type Casting Consistency
In stg__edxorg__s3__course_xml_blocks.sql line 14:

cast(duration as decimal(38, 4)) as video_duration,

Consider handling NULL/empty string cases explicitly if the source data might have those values.

2. Error Logging
In process_course_xml_blocks() at line 442-444, the broad exception handler silently continues. Consider logging at least the file path for debugging:

except Exception as e:  # noqa: BLE001, S112
    context.log.warning(f"Skipping malformed XML file {member.path}: {e}")
    continue

3. Block Type Validation
The code filters known block types but also allows unknown types to pass through (lines 378-384). This is flexible but could potentially include unexpected data. Consider logging unknown block types for monitoring.

✔️ Validation Passed

Python syntax: Valid
YAML syntax: Valid
Recent fixes properly address filtering issues
Follows existing codebase patterns for staging models
Proper use of dbt macros (deduplicate_raw_table, cast_timestamp_to_iso8601)

📊 Code Statistics

263 lines added across 6 files
Core extraction function: 127 lines (appropriately complex for comprehensive XML processing)
Clean integration with existing asset pipeline

The implementation is production-ready. The minor suggestions are optional improvements for enhanced observability.

quazi-h · 2026-02-26T19:12:28Z

@copilot code review[agent] please review this branch again with the most recent changes

Copilot · 2026-02-26T19:14:59Z

@copilot code review[agent] please review this branch again with the most recent changes

Updated Code Review - Latest Changes

I've reviewed the branch with the 2 most recent commits (45406f2 and c0d99fe). Significant improvements have been made based on previous feedback!

✅ Excellent Improvements

1. Allowlist Approach for Block Types (45406f2)

Changed from blocklist to allowlist using a set of known block types
Prevents unexpected data from being included
Added debug logging for unknown block types: log.debug("Skipping unknown block type...")
✅ Addresses my previous suggestion Add data quality tests to edX pipeline #3

2. Enhanced Error Handling (c0d99fe)

Added try/except with descriptive ValueError when course.xml is missing
Provides clear error message: "course.xml not found in archive root..."
Properly chains exceptions with from err
✅ Partially addresses my previous suggestion Add docker-compose definitions with database fixtures for local dev #2

3. NULL Handling in Type Casting (45406f2)

SQL now uses cast(nullif(trim(duration), '') as decimal(38, 4))
Properly handles NULL, empty string, and whitespace-only values
✅ Addresses my previous suggestion Migrate to use dagster file manager #1

4. Source System Tracking (45406f2)

Added source_system parameter to process_course_xml_blocks()
Included in block data and staging model
Enables differentiation between prod/edge data

5. Static Assets Extraction (45406f2)

New feature: Collects all non-XML files into course_static_assets tar archive
Includes SRT subtitles, HTML content, PDFs
Returns tuple: (blocks, non_xml_assets_file)
New asset output properly integrated into Dagster pipeline

6. .gitignore Cleanup (45406f2)

Removed merge conflict markers
Clean file structure

7. Improved Logging (45406f2)

Exception handler now logs: log.warning("Skipping malformed XML file: %s", member.path)
Uses proper logger instead of silent failure
✅ Fully addresses my previous suggestion Add docker-compose definitions with database fixtures for local dev #2

⚠️ Missing Documentation

Issue: source_system field is missing from YAML documentation

The field is used in:

Line 8 of stg__edxorg__s3__course_xml_blocks.sql
Line 436 in process_course_xml_blocks() function

But not documented in:

_edxorg_sources.yml (raw table columns)
_stg__edxorg__models.yml (staging model columns)

Recommendation: Add source_system documentation:

# In _edxorg_sources.yml after course_id:
- name: source_system
  description: str, ETL source system identifier (e.g. "prod", "edge")

# In _stg__edxorg__models.yml after courserun_readable_id:
- name: coursestructure_xml_source_system
  description: str, ETL source system identifier (prod or edge)
  tests:
  - not_null

✔️ Validation Results

Python syntax: ✅ Valid
YAML syntax: ✅ Valid (both files)
Integration: ✅ Proper tuple return and unpacking
Error handling: ✅ Comprehensive with descriptive messages
Type safety: ✅ Improved NULL handling

📊 Final Statistics

310 lines added across 6 files
2 new commits addressing all previous feedback
New feature: Static assets extraction
Enhanced error handling and logging

Status: Production-ready pending documentation update for source_system field.

…ub Actions workflow Co-authored-by: blarghmatey <479088+blarghmatey@users.noreply.github.com>

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

for more information, see https://pre-commit.ci

- Add pytest-cov>=6.0.0,<7 to dev dependencies so --cov flags are recognized - Remove EnricoMi/publish-unit-test-result-action step and --junit-xml flag Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…t it uv sync from the workspace root does not install workspace members unless they are explicitly listed as dependencies. Add ol-orchestrate-lib to the dev dependency group with a workspace source so pytest can import ol_orchestrate during test collection. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

for more information, see https://pre-commit.ci

Legitimate fixes in test_openedx.py: - Shorten module docstring to fix E501 (line too long) - Use raw string in pytest.raises(match=) to fix RUF043 (regex metacharacters) - Remove unused 'as tar' alias in with-statement to fix F841 Ruff config: - Add scripts/** per-file-ignores matching project conventions - Expand tests/** to **/tests/** so nested test dirs are matched - Add PLR, RUF059, S106, SLF001, ERA001 ignores for test files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

…owlist, fix required_resource_keys, improve static assets

for more information, see https://pre-commit.ci

…k extraction

…l production archives

for more information, see https://pre-commit.ci

…e-xml-contents

…ned bundle with manifest - Bug fix: non-XML files in drafts/, assets/, and static/ directories were incorrectly included in static_assets because the directory exclusion check was only applied after the non-XML branch had already appended them and continued. Moved the block_type exclusion check before the non-XML branch so it applies to all files regardless of extension. - Add CourseStaticAssetsBundle Pydantic model with: - data_version: SHA-256 content hash of all static files sorted by path; changes only when asset content actually changes - manifest: JSON-serializable dict with data_version, file_count, and per- file metadata (path, mime_type, size_bytes) - files: raw (relative_path, bytes) pairs for archive construction - Update process_course_xml_blocks() return type from list[tuple[str, bytes]] to CourseStaticAssetsBundle. - Add course_static_assets_manifest AssetOut in openedx_course_archives.py: yields a separate manifest.json S3 object at a fixed key so downstream consumers can check the data version and inspect available files without fetching the full archive. The archive data_version now reflects the content hash of the static assets themselves (not the original raw archive hash). - Update integration tests: use CourseStaticAssetsBundle fields, move import io to module level, fix pre-existing ruff violations (PT006, ARG001, PT018), add detect-secrets allowlist comment, add test_real_archive_static_assets_bundle_version_and_manifest to assert data_version format, manifest consistency, and exclusion of drafts/ assets/static directories from the bundle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

quazi-h · 2026-03-31T21:00:24Z

There are some failing CI checks (ruff check and detect secrets) but otherwise all the feedback that was previously mentioned on this PR has been addressed.

Reworked logic for handling different types of files
Two new pydantic classes have been added (CourseXml Block and CourseStaticAssetsBundle
New Dagster output — openedx_course_archives.py
Updated integration tests — test_openedx_integration.py
Migrated to CourseStaticAssetsBundle API
Added test_real_archive_static_assets_bundle_version_and_manifest asserting version format, manifest consistency, and directory exclusion correctness
Fixed pre-existing ruff violations (PT006, ARG001, PT018) and moved import io to module level
Added comments where necessary

…etsBundle - dg_projects/openedx/openedx/assets/openedx.py: mirror the edxorg changes — rename static_assets to static_bundle, use bundle.files for tar construction, use bundle.data_version as DataVersion, use bundle.manifest['file_count'] for metadata, and add a new course_static_assets_manifest AssetOut that yields manifest.json to S3. - packages/ol-orchestrate-lib/tests/lib/test_openedx.py: update test_process_course_xml_blocks_static_assets to use CourseStaticAssetsBundle API (bundle.files instead of the old tuple list). Fix the test fixture so non-XML files live in non-excluded directories (subtitles/, html/) and a file in static/ is used to verify the exclusion is enforced. Add assertion that no files from drafts/assets/static directories appear in the bundle. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Conflicts in _combined_models.yml and _stg__edxorg__models.yml arose because the dbt-fusion migration (fcd88ee) wrapped test arguments in an arguments: key and re-quoted string values. Resolved by taking the new format for all existing content while preserving our new model definitions (int__combined__course_xml_blocks and stg__edxorg__s3__course_xml_blocks). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

sentry · 2026-04-03T19:03:50Z

+                continue
+
+            # Skip archive structural directories that are not course block types:
+            # - drafts/: Studio draft workspace, not published content
+            # - assets/ and static/: Static file storage directories, not blocks


Bug: Temporary files for course assets are created without try...finally protection, causing file leaks on the filesystem if an I/O exception occurs before cleanup can be performed.
_{Severity: MEDIUM}

Suggested Fix

Wrap the creation and processing of temporary files within a try...finally block. The file should be created in the try block, and the cleanup logic (e.g., os.remove(file_path)) should be placed in the finally block to ensure it executes even if an exception occurs.

Prompt for AI Agent

Review the code at the location below. A potential bug has been identified by an AI agent. Verify if this is a real issue. If it is, propose a fix; if not, explain why it's not valid. Location: packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py#L478-L482 Potential issue: In `extract_courserun_details()` and `extract_edxorg_courserun_metadata()`, temporary files such as `course_xml_blocks_file`, `static_assets_file`, and `static_manifest_file` are created without being enclosed in a `try...finally` block. If an exception is raised during subsequent I/O operations (e.g., writing with `jsonlines` or creating a `tarfile`) after file creation but before the file is yielded, the cleanup logic is bypassed. This results in temporary files being left on the filesystem, which can accumulate with each failed asset materialization and eventually exhaust disk space.

…essing Wrap the full generator body of extract_edxorg_courserun_metadata() and extract_courserun_details() in try...finally blocks. All NamedTemporaryFile paths are tracked in a temp_files list and unconditionally unlinked in the finally clause, ensuring no orphaned files accumulate on disk when an exception is raised mid-materialization. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI assigned Copilot and blarghmatey Oct 20, 2025

Copilot started work on behalf of blarghmatey October 20, 2025 17:30 View session

Copilot AI changed the title ~~[WIP] Process course XML contents for data lakehouse and dbt~~ Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses Oct 20, 2025

Copilot AI requested a review from blarghmatey October 20, 2025 17:53

Copilot finished work on behalf of blarghmatey October 20, 2025 17:53

Copilot started work on behalf of blarghmatey January 23, 2026 20:56 View session

quazi-h self-assigned this Feb 3, 2026

quazi-h self-requested a review February 3, 2026 14:34

Copilot started work on behalf of quazi-h February 5, 2026 18:19 View session

quazi-h force-pushed the copilot/process-course-xml-contents branch 2 times, most recently from d241460 to cc1f98d Compare February 23, 2026 17:57

Copilot started work on behalf of quazi-h February 23, 2026 18:06 View session

Copilot finished work on behalf of quazi-h February 23, 2026 18:09

quazi-h marked this pull request as ready for review February 24, 2026 15:32

sentry Bot reviewed Feb 24, 2026

View reviewed changes

Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py Outdated

quazi-h force-pushed the copilot/process-course-xml-contents branch from cc1f98d to 45406f2 Compare February 26, 2026 18:15

sentry Bot reviewed Feb 26, 2026

View reviewed changes

Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py Outdated

Copilot started work on behalf of quazi-h February 26, 2026 19:12 View session

Copilot finished work on behalf of quazi-h February 26, 2026 19:15

Copilot AI and others added 8 commits March 24, 2026 12:17

Add comprehensive tests for process_course_xml_blocks and pytest GitH…

d2a1d6c

…ub Actions workflow Co-authored-by: blarghmatey <479088+blarghmatey@users.noreply.github.com>

Potential fix for code scanning alert no. 5: Workflow does not contai…

3005219

…n permissions Co-authored-by: Copilot Autofix powered by AI <62310815+github-advanced-security[bot]@users.noreply.github.com>

[pre-commit.ci] auto fixes from pre-commit.com hooks

7cb6e96

for more information, see https://pre-commit.ci

[pre-commit.ci] auto fixes from pre-commit.com hooks

8a1fd48

for more information, see https://pre-commit.ci

feat: Address PR feedback - CourseXmlBlock model, raw_xml, remove all…

4b5a371

…owlist, fix required_resource_keys, improve static assets

quazi-h force-pushed the copilot/process-course-xml-contents branch from a8789c2 to 4b5a371 Compare March 24, 2026 17:17

pre-commit-ci Bot and others added 6 commits March 24, 2026 17:19

[pre-commit.ci] auto fixes from pre-commit.com hooks

6ce1c6c

for more information, see https://pre-commit.ci

fix: Exclude drafts, assets, and static archive directories from bloc…

ed2ae64

…k extraction

test: Add integration tests for process_course_xml_blocks against rea…

ba85437

…l production archives

[pre-commit.ci] auto fixes from pre-commit.com hooks

9c63cf8

for more information, see https://pre-commit.ci

chore: Merge main into copilot/process-course-xml-contents

6b6ad03

chore: Merge main into copilot/process-course-xml-contents

d7b5bef

sentry Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread packages/ol-orchestrate-lib/src/ol_orchestrate/lib/openedx.py Outdated

quazi-h and others added 2 commits March 31, 2026 16:38

Merge remote-tracking branch 'origin/main' into copilot/process-cours…

32c37f3

…e-xml-contents

sentry Bot reviewed Mar 31, 2026

View reviewed changes

Comment thread dg_projects/openedx/openedx/assets/openedx.py Outdated

quazi-h and others added 2 commits April 1, 2026 11:21

blarghmatey approved these changes Apr 2, 2026

View reviewed changes

quazi-h added 3 commits April 3, 2026 13:40

Merge branch 'main' into copilot/process-course-xml-contents

7513d57

Updated python version in pre-commit config because of failing yamllint

4b0cf8a

Merge branch 'main' into copilot/process-course-xml-contents

e9273a1

sentry Bot reviewed Apr 3, 2026

View reviewed changes

quazi-h merged commit f0c49f6 into main Apr 3, 2026
7 checks passed

quazi-h deleted the copilot/process-course-xml-contents branch April 3, 2026 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses with test coverage#1737

Add comprehensive course XML block extraction and dbt staging models for edX.org and Open edX courses with test coverage#1737
quazi-h merged 33 commits into
mainfrom
copilot/process-course-xml-contents

Copilot AI commented Oct 20, 2025 •

edited

Loading

Uh oh!

blarghmatey commented Jan 23, 2026

Uh oh!

Copilot AI commented Jan 23, 2026

Uh oh!

quazi-h commented Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

quazi-h commented Feb 23, 2026

Uh oh!

Copilot AI commented Feb 23, 2026

Uh oh!

Uh oh!

Uh oh!

quazi-h commented Feb 26, 2026

Uh oh!

Copilot AI commented Feb 26, 2026

Uh oh!

Uh oh!

quazi-h commented Mar 31, 2026

Uh oh!

Uh oh!

sentry Bot Apr 3, 2026

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

Copilot AI commented Oct 20, 2025 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Process course XML contents for Open edX and edX.org courses into raw layer and dbt staging models

✅ Successfully Rebased on Latest Main

Changes in This PR

1. XML Block Extraction Function

2. Dagster Asset Integration

3. dbt Source Definitions

4. dbt Staging Models

5. Combined Intermediate Model

6. Model Documentation

7. Comprehensive Test Suite

8. GitHub Actions CI Workflow

Files Changed

Data Flow

Key Features

Testing

Post-Merge Actions

Background

Current State

What's Missing

Requirements

1. Raw Layer Data Loading

2. dbt Staging Models

3. Integration with Existing Pipeline

Technical Considerations

XML Structure

Uh oh!

blarghmatey commented Jan 23, 2026

Uh oh!

Copilot AI commented Jan 23, 2026

Uh oh!

quazi-h commented Feb 5, 2026

Uh oh!

Copilot AI commented Feb 5, 2026

Uh oh!

quazi-h commented Feb 23, 2026

Uh oh!

Copilot AI commented Feb 23, 2026

Code Review Summary

✅ Strengths

💡 Minor Suggestions

✔️ Validation Passed

📊 Code Statistics

Uh oh!

Uh oh!

Uh oh!

quazi-h commented Feb 26, 2026

Uh oh!

Copilot AI commented Feb 26, 2026

Updated Code Review - Latest Changes

✅ Excellent Improvements

⚠️ Missing Documentation

✔️ Validation Results

📊 Final Statistics

Uh oh!

Uh oh!

quazi-h commented Mar 31, 2026

Uh oh!

Uh oh!

sentry Bot Apr 3, 2026

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Copilot AI commented Oct 20, 2025 •

edited

Loading