Expand coverage of dimensional schemas#1868
Conversation
There was a problem hiding this comment.
Pull request overview
This PR expands the dimensional modeling coverage by adding 11 dimension tables, 4 transaction fact tables, 4 bridge tables, and 2 utility macros following Kimball best practices. The changes adapt the data warehouse to support comprehensive business intelligence reporting across MIT Open Learning platforms.
Changes:
- Added transaction fact tables for enrollments, orders, payments, and certificates with incremental loading
- Created dimension tables for courses, course runs, programs, products, instructors, topics, departments, dates, and reference data
- Implemented bridge tables to handle many-to-many relationships between courses and instructors/topics/departments, and programs and courses
- Added utility macros for ISO8601 date parsing and resource type inference
- Fixed Python version requirement from 3.14 (non-existent) to 3.13
Reviewed changes
Copilot reviewed 31 out of 32 changed files in this pull request and generated 29 comments.
Show a summary per file
| File | Description |
|---|---|
| tfact_payment.sql | Transaction fact for payment events across platforms with incremental loading |
| tfact_order.sql | Transaction fact for e-commerce orders with incremental loading |
| tfact_enrollment.sql | Transaction fact for course/program enrollments from all platforms |
| tfact_certificate.sql | Transaction fact for certificate issuance events |
| dim_topic.sql | Hierarchical topic taxonomy dimension consolidated from all platforms |
| dim_program.sql | Program catalog dimension with SCD Type 2 tracking |
| dim_product.sql | Product catalog dimension with SCD Type 2 for price changes |
| dim_payment_method.sql | Static reference dimension for payment methods |
| dim_instructor.sql | Instructor dimension consolidated from all platforms |
| dim_discount_type.sql | Static reference dimension for discount types |
| dim_department.sql | MIT department dimension |
| dim_date.sql | Standard date dimension with MIT fiscal and academic calendars |
| dim_course_run.sql | Specific course offering dimension with SCD Type 2 |
| dim_course.sql | Master course catalog dimension with SCD Type 2 |
| dim_certificate_type.sql | Static reference dimension for certificate types |
| bridge_program_course.sql | Many-to-many bridge between programs and courses |
| bridge_course_topic.sql | Many-to-many bridge between courses and topics |
| bridge_course_instructor.sql | Many-to-many bridge between courses and instructors |
| bridge_course_department.sql | Many-to-many bridge between courses and departments |
| _reference_dimensions.yml | Schema documentation for static reference dimensions |
| _phase3_content_catalog.yml | Schema documentation for instructor, topic, and department dimensions |
| _fact_tables.yml | Schema documentation and tests for transaction fact tables |
| _dim_program.yml | Schema documentation and tests for program dimension |
| _dim_product.yml | Schema documentation and tests for product dimension |
| _dim_date.yml | Schema documentation for date dimension |
| _dim_course_run.yml | Schema documentation and tests for course run dimension |
| _dim_course.yml | Schema documentation and tests for course dimension |
| _bridge_program_course.yml | Schema documentation and tests for program-course bridge |
| safe_parse_iso8601_date.sql | Utility macro for safely parsing ISO8601 date strings |
| infer_resource_type.sql | Utility macro for inferring resource type from URL |
| pyproject.toml | Fixed Python version requirement from 3.14 to 3.13 |
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- Remove tests for Phase 3+ dimensions (dim_platform, dim_user) that don't exist yet - Fix incremental logic by adding timestamp columns to final SELECT statements - Add platform checks to joins to prevent incorrect FK matches across platforms - Fix Trino SQL syntax: change FORMAT_DATETIME to date_format, CONCAT to || - Fix fiscal_quarter calculation logic in dim_date - Fix column name typo in dim_program (certification_type -> program_certification_type) - Use synthetic enrollment_id for edxorg to avoid null key collisions - Replace unique tests using null platform_fk with platform string field - Add platform column to dimension outputs - Update comments for clarity
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 32 changed files in this pull request and generated 9 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
|
The organization(a key business concept) dimension appears to be missing. Should we add it? |
- Remove tests for Phase 3+ dimensions (dim_platform, dim_user) that don't exist yet - Fix incremental logic by adding timestamp columns to final SELECT statements - Add platform checks to joins to prevent incorrect FK matches across platforms - Fix Trino SQL syntax: change FORMAT_DATETIME to date_format, CONCAT to || - Fix fiscal_quarter calculation logic in dim_date - Fix column name typo in dim_program (certification_type -> program_certification_type) - Use synthetic enrollment_id for edxorg to avoid null key collisions - Replace unique tests using null platform_fk with platform string field - Add platform column to dimension outputs - Update comments for clarity
a41f83c to
646a8e9
Compare
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 31 out of 31 changed files in this pull request and generated 19 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
- fix(macro): fix format_datetime token replacement order (MMMM/MMM before MM) - fix(macro): make generate_micromasters_program_readable_id cross-db compatible with DuckDB dispatch using list_transform/string_split/list_aggregate - fix(dim_program): add platform_code field and use it in surrogate key to prevent pk collision between mitxonline and micromasters programs - fix(dim_program): use generate_micromasters_program_readable_id macro instead of cast(program_id as varchar) - fix(dim_course_run): implement SCD Type 2 expiration (records_to_expire CTE) - fix(dim_product): implement SCD Type 2 expiration (records_to_expire CTE) - fix(bridge_program_course): add platform to source CTEs and join on it - fix(tfact_enrollment): change user_fk type from varchar to integer - fix(tfact_*): change incremental filter from > to >= to avoid missing same-ts rows - fix(dim_topic): compute parent_topic_fk via self-join instead of always NULL - test(dim_course_run,dim_product): add SCD2 uniqueness tests for is_current rows - docs(dim_program): document platform_readable_id vs platform_code distinction Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
Pull request overview
Copilot reviewed 47 out of 47 changed files in this pull request and generated 32 comments.
💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.
Platform codes (readable names → lowercase codes):
- dim_course.sql, dim_course_run.sql, dim_product.sql: replace
var('mitxonline')/'mitxpro'/'edxorg' with literal codes
- bridge_course_instructor, bridge_course_topic, bridge_course_department:
same fix; joins to dim_course.primary_platform now match
- bridge_program_course: fix join from platform_readable_id → platform_code
to align with dim_course.primary_platform (lowercase code)
- tfact_enrollment, tfact_certificate, tfact_payment, tfact_order: platform
column now uses lowercase codes; CASE expressions updated accordingly
platform_fk integer → varchar:
- dim_course_run, dim_program, dim_product, tfact_enrollment,
tfact_certificate, tfact_payment, tfact_order: cast(null as integer)
→ cast(null as varchar) to match dim_platform.platform_pk type (surrogate key)
Add incremental_strategy='delete+insert':
- dim_course, dim_course_run, dim_product, tfact_enrollment,
tfact_certificate, tfact_payment, tfact_order: explicit strategy for
predictable cross-adapter incremental behavior
dim_course_run grain includes platform:
- Surrogate key inputs now include 'platform' to prevent collisions when
the same courserun_readable_id exists on multiple platforms (3 confirmed)
- not-exists incremental check now includes platform predicate
- records_to_expire join now includes platform predicate
- _dim_course_run.yml unique combination test now includes platform
Logic bugs fixed:
- tfact_order: discount_type_code CASE moved 'Fixed Price: 0%' (free)
check before trailing '%' (percentage) so free orders are correctly
classified instead of falling through to 'percentage'
- tfact_enrollment: edxorg enrollment ID changed from non-deterministic
row_number() to stable surrogate_key(user_id, courserun_readable_id)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…udentmodule_problems
The upstream model no longer has event_type (always 'problem_check') or event_json
(replaced by correct_map + answers). The combined_studentmodule CTE used select *
which no longer includes these columns.
Fix: project explicitly in combined_studentmodule, supplying:
- 'problem_check' as event_type (literal constant — was always this value)
- correct_map as event_json (per-submission correctness map is the equivalent
structured data for consumers that expect event_json from this model)
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…watermarks
- Change materialization from 'table' to 'incremental' to avoid full scans of
~2TB of tracking logs on every run
- Add per-platform watermarks (mitxonline, residential, mitxpro, edxorg) so each
incremental run only processes new events
- Add generate_surrogate_key('platform','openedx_user_id','courserun_readable_id',
'problem_block_id','attempt','event_type','event_timestamp') as event_id for
idempotent upserts
- Pre-aggregate combined_studentmodule to per-attempt grain (max_by on event_timestamp)
before the union, reducing ~57.5M rows to ~28M before the dedup window
- Full-refresh path skips watermark CTEs for backfill
- Update _dim__models.yml: add event_id column description + unique test, replace
compound column test, expand model description
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolved all 10 undocumented-column warnings found by ol-dbt validate: - _dim_course_run.yml: add courserun_start_on, courserun_end_on, enrollment_start, enrollment_end, source_id - _course_catalog_dimensions.yml (dim_topic): add parent_topic_name - _dim__models.yml (dim_user): add micromasters_user_id - _dim__models.yml (tfact_discussion_events): add courserun_fk - _dim__models.yml (tfact_chatbot_events): add platform_fk - _fact_tables.yml (tfact_order): add discount_type_fk - _marts_micromasters__models.yml (marts__micromasters_program_certificates): add program_is_dedp - _combined_models.yml (int__combined__courserun_enrollments): add courseruncertificate_created_on Remaining warnings are on dim_program (SELECT * cannot be statically resolved by sqlglot — requires model rewrite to fix). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imary_platform The model aliases 'platform' as 'primary_platform' in the final SELECT. The model-level uniqueness test was referencing the non-existent 'platform' column, causing COLUMN_NOT_FOUND errors at test runtime. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dim_course: unique test on course_pk must be scoped to is_current = true. SCD2 intentionally produces multiple rows per course_pk (current + historical versions). The bare unique test failed with 138 results. dim_course_run (same SCD2 pattern) correctly omits the bare unique test; aligned dim_course. tfact_studentmodule_problems: dedup dim_user joins per platform to prevent fan-out when the same openedx_user_id maps to multiple dim_user rows (merged identities). Added per-platform CTEs (mitxonline_users, mitxpro_users, residential_users) that GROUP BY the joining key, ensuring 1:1 join. This eliminates 15 duplicate rows that caused the compound unique test to fail with mitxpro data. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Watermark scalar subqueries return NULL for platforms with no existing rows in the target table (e.g. a new platform added after the initial backfill, or the very first incremental run for a platform with no events). The condition 'event_timestamp > NULL' evaluates to NULL (falsy), silently filtering ALL rows for that platform with no error. tfact_problem_events: Replace per-platform watermark CTEs + scalar subquery pattern with LEFT JOIN directly on the shared watermarks CTE per platform, adding (watermarks.max_ts IS NULL OR timestamp > watermarks.max_ts). This matches the established pattern in tfact_certificate/tfact_enrollment/tfact_order. transform_studentmodule_data macro: Wrap both the watermark_expr and the inline fallback subquery paths with the same IS NULL OR > guard. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- dim_program.sql: MicroMasters platform_readable_id was 'MITx Online'
(copy/paste error). Replace with var("micromasters") which resolves to
'MicroMasters', consistent with platform_code = 'micromasters'.
- tfact_problem_events.sql: combined_studentmodule CTE used Trino-specific
max_by(col, event_timestamp) for per-attempt aggregation; not portable to
DuckDB. Replace with ROW_NUMBER() over (...order by event_timestamp desc)
windowing approach using two CTEs (combined_studentmodule_ranked + filter).
Also fix broken watermark references: the old per-platform watermark CTEs
(mitxonline_watermark etc.) were removed in the prior NULL-watermark fix but
the combined_studentmodule WHERE clause still referenced them. Switch to the
shared watermarks CTE with LEFT JOIN + IS NULL OR pattern, consistent with
all other CTEs in this model.
- transform_studentmodule_data.sql: Replace Trino-specific json_query() calls
with cross-db macros (json_query_string for scalar strings, json_extract_value
for JSON object/array values) so the macro runs on DuckDB and StarRocks.
- _fact_tables.yml: Update tfact_payment user_fk description — it is no longer
always null; tfact_payment.sql resolves it via the order→user bridge
(order_user_lookup + dim_user joins). Describe actual nullability conditions.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…latforms MicroMasters: course_position_in_program is a first-class integer field on the courses_course table (0-based, e.g. SC0x=0, SC1x=1, SC2x=2). Thread it through int__micromasters__program_requirements and into the bridge as course_order. Both core and elective courses carry this value via a join back to the courses CTE. MITx Online: programrequirement uses Django Treebeard materialized paths where sibling position is encoded lexicographically (e.g. ...0001, ...0002, ...0003). Capture the course node's own path as course_node_path, thread it through child_requirements and combined_requirements, then compute course_order_in_program as row_number() over (partition by program_id order by course_node_path). xPro: cms_coursesinprogrampage.contents is an ordered JSON array of Wagtail page IDs whose element position = display order. The existing staging model preserves this as array(integer). Change UNNEST to UNNEST WITH ORDINALITY in int__mitxpro__coursesinprogram to recover the array position as course_order. bridge_program_course: replace hardcoded '1 as course_order' placeholder with the real course_order values from each platform's intermediate model. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atform fan-out dim_course_content: add platform column to all 4 source union branches; propagate through latest_course_structure (group by), window function PARTITION BYs, combined_course_content, and final SELECT. dim_video: inherit platform from dim_course_content; fix dedup row_number() partition to (platform, block_id); expose platform in output. dim_discussion_topic: inherit platform from dim_course_content; add to both union branches; include platform in surrogate key generation to prevent collisions when same courserun_readable_id exists on multiple platforms; expose platform in output. dim_problem: inherit platform from dim_course_content; fix dedup row_number() partition to (platform, block_id); expose platform in output. afact_course_page_engagement: scope vertical_structure join by platform. afact_discussion_engagement: scope discussion_topics and course_content joins by platform. afact_problem_engagement: scope problem_structure join by platform. afact_video_engagement: scope all 5 joins (d_video + 4 course_content self-joins) by platform via chain propagation. tfact_certificate: scope edxorg/micromasters courserun_readable_id join to dim_course_run.platform = 'edxorg' (micromasters courses ran on edxorg infrastructure; same readable_id confirmed to exist on 2 platforms). Fixes confirmed fan-out: 3 courserun_readable_ids (MITx/14_73x/1T2014, MITx/14.73x/2013_Spring, MITx/14.73x_1/1T2015) exist on both edxorg and mitxonline in dim_course_run. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
_dim__models.yml: - dim_course_content: add platform column with not_null + accepted_values tests; update compound uniqueness test to include platform (was courserun_readable_id+block_id, now platform+courserun_readable_id+block_id) - dim_video: add platform column; replace bare unique test on video_block_pk with compound_columns_to_be_unique(platform, video_block_pk) since the same block_id can exist on multiple platforms - dim_problem: same pattern as dim_video - dim_discussion_topic: add platform column with not_null test README.md (new): - Documents that courserun_readable_id / course_readable_id are NOT globally unique identifiers -- they must always be paired with platform - Lists 3 confirmed duplicate readable IDs (MITx/14_73x/1T2014 etc.) - Shows correct and incorrect join patterns - Documents which dimension tables require platform scoping and why - Documents platform values and their meanings - Documents the incremental watermark LEFT JOIN pattern Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ct_order The staging model stg__mitxonline__app__postgres__ecommerce_order has order_updated_on (from the updated_on column), but int__mitxonline__ecommerce_order never selected it, causing tfact_order to fall back to order_created_on as a substitute with a misleading comment 'mitxonline has no updated_on'. This caused a silent data correctness bug: the incremental watermark in tfact_order uses max(order_updated_on) per platform to detect changed rows. With order_updated_on = order_created_on for MITx Online, any order state change (e.g. pending -> fulfilled) after the initial run is invisible to the watermark and never re-processed, leaving stale order states in the fact table indefinitely. Fix: add order_updated_on to int__mitxonline__ecommerce_order SELECT; remove the order_created_on substitution and comment from tfact_order. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dim_course_content: replace Trino-specific 'last_value(expr) IGNORE NULLS OVER (...)' with new last_value_ignore_nulls() dispatch macro. Trino places IGNORE NULLS after the closing paren; DuckDB requires it inside the function arguments. Using the macro ensures correct syntax on both adapters. cross_db_functions.sql: add last_value_ignore_nulls dispatch macro. default (Trino): last_value(expr) ignore nulls duckdb: last_value(expr ignore nulls) Both forms are followed by the OVER clause written inline at the call site. int__micromasters__program_requirements: qualify all column references in the elective branch of combined_courses CTE to resolve AMBIGUOUS_NAME error on 'program_id' and 'course_id' (both elective_courses and courses carry these columns after the inner join added for course_position_in_program). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…topic and dim_problem - Add unnest_json_map dispatch macro (Trino: native MAP UNNEST, DuckDB: map_keys/map_values subquery) - Add json_is_object dispatch macro (DuckDB: json_type='OBJECT', Trino: substr start check) - Use substr() instead of left() to avoid LEFT JOIN keyword ambiguity in Trino parser - Add last_value_ignore_nulls dispatch macro (already used by dim_course_content) - dim_discussion_topic: replace Trino-specific MAP UNNEST and json_extract_scalar with cross-db macros - dim_problem: replace Trino-specific MAP UNNEST and json_extract_scalar with cross-db macros; pre-filter non-object JSON submissions via json_is_object before UNNEST; wrap source reads in subquery casting event columns to VARCHAR to handle heterogeneous parquet column types in S3 tracking log files Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atibility Trino's json_query() expects varchar input but map_values(map(varchar, json)) produces the native json type. Use transform_values to cast each value to varchar before unnesting, so downstream json_query_string calls receive varchar. transform_values(NULL, ...) propagates NULL safely → UNNEST(NULL) = 0 rows. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st_json_map Trino does not support cast(json as varchar); the correct function is json_format() which serializes the native json type back to its varchar text representation. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trino does not support cast(json as varchar); json_format() is required to serialize native json type to varchar text. This affects the json_is_object pre-filter in dim_problem's problem_events CTE. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…before dim join Cybersource Secure Acceptance sends req_payment_method='card' for credit card transactions, but dim_payment_method uses the canonical code 'credit_card'. The raw value flowed unchanged into tfact_payment's LEFT JOIN, silently producing NULL payment_method_fk for all credit card rows. No not_null test existed to detect this. Fix: - Normalize 'card' → 'credit_card' via CASE in both intermediate models (int__mitxonline__ecommerce_transaction, int__mitxpro__ecommerce_receipt) so all other consumers of these fields also get canonical values - Add not_null test to tfact_payment.payment_method_fk to prevent regression - Document the raw→canonical mapping in dim_payment_method comment Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nrollments tfact_enrollment processes residential enrollments and joins them to dim_course_run using platform = 'residential'. Since dim_course_run had no residential rows, all residential enrollment facts had courserun_fk = NULL (silent data integrity failure). Add residential_courseruns CTE sourced from int__mitxresidential__courseruns using the same pattern as edxorg: source_id/course_id as NULL (no integer PK available), courserun_is_live as NULL (residential does not expose a published/live flag). The 'residential' platform already exists in dim_platform (platforms.csv seed). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cybersource does not return req_payment_method for refund transactions and zero-dollar payments, resulting in ~17k legitimate NULL payment_method_fk values in tfact_payment. The not_null constraint was too strict. Replace with a relationships test scoped to non-null values (matching the pattern used by other nullable FKs like user_fk). Document the nullability in both the yml description and dim_payment_method. The 'card' → 'credit_card' normalization added in the prior commit remains correct and prevents NULLs for actual credit card charges. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n Trino Trino's from_iso8601_timestamp() uses the session timezone (America/New_York) to interpret timezone-less ISO-8601 strings. Timestamps that fall in the DST spring- forward gap (e.g. 2020-03-08T02:19:36) throw: INVALID_FUNCTION_ARGUMENT: Illegal instant due to time zone offset transition All source timestamps in this project are stored as UTC without a timezone suffix (Django default behavior). Replace from_iso8601_timestamp() with with_timezone(cast(x as timestamp), 'UTC') which correctly interprets them as UTC. Changes: - from_iso8601_timestamp macro (default/Trino): use with_timezone(cast, 'UTC') - _stg_mitxpro__models.yml recency test field: same - _stg_mitxonline__models.yml recency test field: fix from_iso8601_timestamp_nanos - _mitxpro__sources.yml loaded_at_field: same - _mitxonline__sources.yml loaded_at_field: same Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1_timestamp macro Previous fix (with_timezone + cast) broke strings that already carry a timezone offset (e.g. '2025-05-13T11:38:39.923495000-04:00' from to_iso8601()) because Trino's cast(varchar as timestamp) cannot parse ISO-8601 strings with offsets or nanoseconds. Correct approach: detect whether the string already has a timezone marker (ending in Z/z or +HH:MM/-HH:MM via regexp_like) and only append 'Z' when absent. Strings with an existing zone pass through to from_iso8601_timestamp() unchanged. Also update YAML recency test fields and source freshness loaded_at_field to use from_iso8601_timestamp(col || 'Z') directly (these fields are known UTC-without-TZ). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…updated_on tracking Addresses PR review feedback from @rachellougee: 1. Program enrollment null fields: int__mitxonline__programenrollments has enrollment_mode and enrollment_status; they were hardcoded as null in tfact_enrollment. Now populated from programenrollment_enrollment_mode/status. 2. Status change detection: tfact_enrollment watermark was based solely on enrollment_created_on. Status/active flag changes (e.g. deactivation) would not advance the watermark and would be silently dropped on incremental runs. Fix: watermark on coalesce(enrollment_updated_on, enrollment_created_on) and use >= comparison so boundary-second updates are recaptured. 3. Propagate updated_on through intermediates: - int__mitxonline__courserunenrollments: add courserunenrollment_updated_on - int__mitxpro__courserunenrollments: add courserunenrollment_updated_on - int__mitxonline__programenrollments: add programenrollment_updated_on These replace the cast(null as varchar) placeholders in tfact_enrollment. edxorg (S3 snapshot) and residential (OpenEdX student_courseenrollment) have no updated_on field in their source tables; those remain null as enrollment_updated_on. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…act_enrollment courserunenrollment_enrollment_status was already present in both the xPro staging model (as change_status) and intermediate model, but was hardcoded as null in tfact_enrollment's mitxpro_enrollments CTE. Now correctly mapped. edxorg (S3 snapshot) and residential (OpenEdX) genuinely have no enrollment_status field in their source tables; those remain null. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
edxorg courserun_readable_id uses old-style slash format (MITx/6.00.1x/1T2014),
converted from course-v1: format in int__edxorg__mitx_courseruns. The CASE statement
in the dim_course join only stripped the run segment for 'course-v1:' prefixed IDs;
the ELSE branch passed the full 3-segment slash ID as the join key, which never
matched dim_course.course_readable_id (2-segment: MITx/6.00.1x), resulting in
course_fk = NULL for all edxorg course runs.
Fix: add a second WHEN branch using regexp_like to detect 3-segment slash IDs and
strip the trailing /{run} segment before the join.
Note: the AI agent's stated hypothesis (that dim_course stores 3-segment IDs) was
incorrect — dim_course always stores 2-segment course IDs. The actual bug was in
the ELSE branch silently passing the run ID as the course ID for slash-format runs.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
course_number was selected in new_and_changed_courses but omitted from the incremental change-detection WHERE predicate. Changes to course_number alone (without a concurrent title/description/is_live change) would be silently dropped — no new historical record created and the existing row left with stale data. Fix: add coalesce(existing.course_number, '') = coalesce(current_courses.course_number, '') to the NOT EXISTS predicate, consistent with how course_description is handled. Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ection In dim_course_run, residential course runs have courserun_is_live = NULL (set explicitly in the residential_courseruns CTE). The comparison existing.courserun_is_live = courseruns_with_all_fks.courserun_is_live evaluates to NULL (not TRUE) when both sides are NULL, so the NOT EXISTS predicate never matches existing residential rows. Every incremental run treats all residential course runs as changed, creating unbounded duplicate SCD2 records. Same pattern found and fixed in dim_course for course_is_live, which can be NULL for platforms without a native is_live flag. Fix: wrap both sides with coalesce(..., false) — consistent with how other nullable fields in the same predicate are handled (courserun_start_on etc. use coalesce with ''). Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
There was a problem hiding this comment.
I think this is good enough for the first pass since dbt runs successfully. However, there are some data quality issues we’ll need to address once we start using these models in reporting - e.g. dim_course_run, dim_course, dim_program
For example, dim_course_run still has a failing test that is currently only set to warn:
15:26:59 Warning in test not_null_dim_course_run_course_fk (models/dimensional/_dim_course_run.yml)
15:26:59 Got 2872 results, configured to warn if != 0
15:26:59
Looking into one case, the course_pk for source_id = 1803 in dim_course_run is null, but it should reference source_id = 255 in dim_course.
Since these models are not currently used in any downstream models, I’m okay leaving it as-is for now, but we will need to validate them once we start using them.
What are the relevant tickets?
Addresses suggestions in https://docs.google.com/document/d/109CCD7cUeaGs2W6TosCGcXdycQiyfKAd1bkXg8MxC7c/edit?tab=t.0#heading=h.3qkb2s5dnkij
Description (What does it do?)
Adds the proposed dimensional models to improve our coverage and adapt to Kimball modeling best practices
How can this be tested?
Run the dbt models and verify that they complete and that the data looks correct.