Skip to content

Expand coverage of dimensional schemas#1868

Merged
blarghmatey merged 85 commits into
mainfrom
feature/dimensional-model-expansion
Apr 6, 2026
Merged

Expand coverage of dimensional schemas#1868
blarghmatey merged 85 commits into
mainfrom
feature/dimensional-model-expansion

Conversation

@blarghmatey
Copy link
Copy Markdown
Member

@blarghmatey blarghmatey commented Jan 28, 2026

What are the relevant tickets?

Addresses suggestions in https://docs.google.com/document/d/109CCD7cUeaGs2W6TosCGcXdycQiyfKAd1bkXg8MxC7c/edit?tab=t.0#heading=h.3qkb2s5dnkij

Description (What does it do?)

Adds the proposed dimensional models to improve our coverage and adapt to Kimball modeling best practices

How can this be tested?

Run the dbt models and verify that they complete and that the data looks correct.

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR expands the dimensional modeling coverage by adding 11 dimension tables, 4 transaction fact tables, 4 bridge tables, and 2 utility macros following Kimball best practices. The changes adapt the data warehouse to support comprehensive business intelligence reporting across MIT Open Learning platforms.

Changes:

  • Added transaction fact tables for enrollments, orders, payments, and certificates with incremental loading
  • Created dimension tables for courses, course runs, programs, products, instructors, topics, departments, dates, and reference data
  • Implemented bridge tables to handle many-to-many relationships between courses and instructors/topics/departments, and programs and courses
  • Added utility macros for ISO8601 date parsing and resource type inference
  • Fixed Python version requirement from 3.14 (non-existent) to 3.13

Reviewed changes

Copilot reviewed 31 out of 32 changed files in this pull request and generated 29 comments.

Show a summary per file
File Description
tfact_payment.sql Transaction fact for payment events across platforms with incremental loading
tfact_order.sql Transaction fact for e-commerce orders with incremental loading
tfact_enrollment.sql Transaction fact for course/program enrollments from all platforms
tfact_certificate.sql Transaction fact for certificate issuance events
dim_topic.sql Hierarchical topic taxonomy dimension consolidated from all platforms
dim_program.sql Program catalog dimension with SCD Type 2 tracking
dim_product.sql Product catalog dimension with SCD Type 2 for price changes
dim_payment_method.sql Static reference dimension for payment methods
dim_instructor.sql Instructor dimension consolidated from all platforms
dim_discount_type.sql Static reference dimension for discount types
dim_department.sql MIT department dimension
dim_date.sql Standard date dimension with MIT fiscal and academic calendars
dim_course_run.sql Specific course offering dimension with SCD Type 2
dim_course.sql Master course catalog dimension with SCD Type 2
dim_certificate_type.sql Static reference dimension for certificate types
bridge_program_course.sql Many-to-many bridge between programs and courses
bridge_course_topic.sql Many-to-many bridge between courses and topics
bridge_course_instructor.sql Many-to-many bridge between courses and instructors
bridge_course_department.sql Many-to-many bridge between courses and departments
_reference_dimensions.yml Schema documentation for static reference dimensions
_phase3_content_catalog.yml Schema documentation for instructor, topic, and department dimensions
_fact_tables.yml Schema documentation and tests for transaction fact tables
_dim_program.yml Schema documentation and tests for program dimension
_dim_product.yml Schema documentation and tests for product dimension
_dim_date.yml Schema documentation for date dimension
_dim_course_run.yml Schema documentation and tests for course run dimension
_dim_course.yml Schema documentation and tests for course dimension
_bridge_program_course.yml Schema documentation and tests for program-course bridge
safe_parse_iso8601_date.sql Utility macro for safely parsing ISO8601 date strings
infer_resource_type.sql Utility macro for inferring resource type from URL
pyproject.toml Fixed Python version requirement from 3.14 to 3.13

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/bridge_program_course.sql Outdated
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml Outdated
Comment thread src/ol_dbt/models/dimensional/_dim_program.yml Outdated
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml Outdated
blarghmatey added a commit that referenced this pull request Jan 30, 2026
- Remove tests for Phase 3+ dimensions (dim_platform, dim_user) that don't exist yet
- Fix incremental logic by adding timestamp columns to final SELECT statements
- Add platform checks to joins to prevent incorrect FK matches across platforms
- Fix Trino SQL syntax: change FORMAT_DATETIME to date_format, CONCAT to ||
- Fix fiscal_quarter calculation logic in dim_date
- Fix column name typo in dim_program (certification_type -> program_certification_type)
- Use synthetic enrollment_id for edxorg to avoid null key collisions
- Replace unique tests using null platform_fk with platform string field
- Add platform column to dimension outputs
- Update comments for clarity
@blarghmatey blarghmatey requested a review from Copilot January 30, 2026 17:16
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 32 changed files in this pull request and generated 9 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ol_dbt/models/dimensional/dim_program.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_program.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_product.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_payment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_product.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_order.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_payment.sql
Comment thread src/ol_dbt/models/dimensional/tfact_payment.sql
Comment thread src/ol_dbt/models/dimensional/dim_course.sql
Comment thread src/ol_dbt/models/dimensional/dim_program.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_course.sql Outdated
@KatelynGit
Copy link
Copy Markdown
Contributor

The organization(a key business concept) dimension appears to be missing. Should we add it?

blarghmatey added a commit that referenced this pull request Feb 19, 2026
- Remove tests for Phase 3+ dimensions (dim_platform, dim_user) that don't exist yet
- Fix incremental logic by adding timestamp columns to final SELECT statements
- Add platform checks to joins to prevent incorrect FK matches across platforms
- Fix Trino SQL syntax: change FORMAT_DATETIME to date_format, CONCAT to ||
- Fix fiscal_quarter calculation logic in dim_date
- Fix column name typo in dim_program (certification_type -> program_certification_type)
- Use synthetic enrollment_id for edxorg to avoid null key collisions
- Replace unique tests using null platform_fk with platform string field
- Add platform column to dimension outputs
- Update comments for clarity
@blarghmatey blarghmatey force-pushed the feature/dimensional-model-expansion branch from a41f83c to 646a8e9 Compare February 19, 2026 21:29
Comment thread src/ol_dbt/models/dimensional/tfact_payment.sql
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 31 out of 31 changed files in this pull request and generated 19 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ol_dbt/models/dimensional/dim_program.sql
Comment thread src/ol_dbt/models/dimensional/dim_program.sql
Comment thread src/ol_dbt/models/dimensional/dim_program.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_product.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_product.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_topic.sql Outdated
Comment thread src/ol_dbt/models/dimensional/_dim_course_run.yml
Comment thread src/ol_dbt/models/dimensional/_dim_product.yml
blarghmatey added a commit that referenced this pull request Feb 20, 2026
- fix(macro): fix format_datetime token replacement order (MMMM/MMM before MM)
- fix(macro): make generate_micromasters_program_readable_id cross-db compatible
  with DuckDB dispatch using list_transform/string_split/list_aggregate
- fix(dim_program): add platform_code field and use it in surrogate key to prevent
  pk collision between mitxonline and micromasters programs
- fix(dim_program): use generate_micromasters_program_readable_id macro instead
  of cast(program_id as varchar)
- fix(dim_course_run): implement SCD Type 2 expiration (records_to_expire CTE)
- fix(dim_product): implement SCD Type 2 expiration (records_to_expire CTE)
- fix(bridge_program_course): add platform to source CTEs and join on it
- fix(tfact_enrollment): change user_fk type from varchar to integer
- fix(tfact_*): change incremental filter from > to >= to avoid missing same-ts rows
- fix(dim_topic): compute parent_topic_fk via self-join instead of always NULL
- test(dim_course_run,dim_product): add SCD2 uniqueness tests for is_current rows
- docs(dim_program): document platform_readable_id vs platform_code distinction

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread src/ol_dbt/models/dimensional/dim_course.sql
Comment thread src/ol_dbt/models/dimensional/tfact_discussion_events.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_course_run.sql
Comment thread src/ol_dbt/models/dimensional/dim_course.sql
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 47 out of 47 changed files in this pull request and generated 32 comments.


💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment thread src/ol_dbt/models/dimensional/bridge_course_instructor.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_course_run.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_program.sql Outdated
Comment thread src/ol_dbt/models/dimensional/_dim_course_run.yml
Comment thread src/ol_dbt/models/dimensional/dim_product.sql Outdated
Comment thread src/ol_dbt/models/dimensional/bridge_course_department.sql Outdated
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql
Comment thread src/ol_dbt/models/dimensional/tfact_certificate.sql Outdated
Comment thread src/ol_dbt/models/dimensional/dim_course.sql
Comment thread src/ol_dbt/models/dimensional/tfact_discussion_events.sql
blarghmatey added a commit that referenced this pull request Feb 20, 2026
Platform codes (readable names → lowercase codes):
- dim_course.sql, dim_course_run.sql, dim_product.sql: replace
  var('mitxonline')/'mitxpro'/'edxorg' with literal codes
- bridge_course_instructor, bridge_course_topic, bridge_course_department:
  same fix; joins to dim_course.primary_platform now match
- bridge_program_course: fix join from platform_readable_id → platform_code
  to align with dim_course.primary_platform (lowercase code)
- tfact_enrollment, tfact_certificate, tfact_payment, tfact_order: platform
  column now uses lowercase codes; CASE expressions updated accordingly

platform_fk integer → varchar:
- dim_course_run, dim_program, dim_product, tfact_enrollment,
  tfact_certificate, tfact_payment, tfact_order: cast(null as integer)
  → cast(null as varchar) to match dim_platform.platform_pk type (surrogate key)

Add incremental_strategy='delete+insert':
- dim_course, dim_course_run, dim_product, tfact_enrollment,
  tfact_certificate, tfact_payment, tfact_order: explicit strategy for
  predictable cross-adapter incremental behavior

dim_course_run grain includes platform:
- Surrogate key inputs now include 'platform' to prevent collisions when
  the same courserun_readable_id exists on multiple platforms (3 confirmed)
- not-exists incremental check now includes platform predicate
- records_to_expire join now includes platform predicate
- _dim_course_run.yml unique combination test now includes platform

Logic bugs fixed:
- tfact_order: discount_type_code CASE moved 'Fixed Price: 0%' (free)
  check before trailing '%' (percentage) so free orders are correctly
  classified instead of falling through to 'percentage'
- tfact_enrollment: edxorg enrollment ID changed from non-deterministic
  row_number() to stable surrogate_key(user_id, courserun_readable_id)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Comment thread src/ol_dbt/models/dimensional/tfact_enrollment.sql Outdated
@KatelynGit KatelynGit assigned KatelynGit and unassigned KatelynGit Feb 23, 2026
Comment thread src/ol_dbt/models/dimensional/_dim_course.yml Outdated
Comment thread src/ol_dbt/models/dimensional/_dim_course.yml
Comment thread src/ol_dbt/models/dimensional/_dim_course.yml
Comment thread src/ol_dbt/models/dimensional/_dim_course.yml
Comment thread src/ol_dbt/models/dimensional/_dim_course_run.yml Outdated
Comment thread src/ol_dbt/models/dimensional/_dim_program.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml
Comment thread src/ol_dbt/models/dimensional/_fact_tables.yml Outdated
@KatelynGit KatelynGit self-assigned this Feb 23, 2026
blarghmatey and others added 28 commits April 4, 2026 12:16
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…udentmodule_problems

The upstream model no longer has event_type (always 'problem_check') or event_json
(replaced by correct_map + answers). The combined_studentmodule CTE used select *
which no longer includes these columns.

Fix: project explicitly in combined_studentmodule, supplying:
  - 'problem_check' as event_type (literal constant — was always this value)
  - correct_map as event_json (per-submission correctness map is the equivalent
    structured data for consumers that expect event_json from this model)

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…watermarks

- Change materialization from 'table' to 'incremental' to avoid full scans of
  ~2TB of tracking logs on every run
- Add per-platform watermarks (mitxonline, residential, mitxpro, edxorg) so each
  incremental run only processes new events
- Add generate_surrogate_key('platform','openedx_user_id','courserun_readable_id',
  'problem_block_id','attempt','event_type','event_timestamp') as event_id for
  idempotent upserts
- Pre-aggregate combined_studentmodule to per-attempt grain (max_by on event_timestamp)
  before the union, reducing ~57.5M rows to ~28M before the dedup window
- Full-refresh path skips watermark CTEs for backfill
- Update _dim__models.yml: add event_id column description + unique test, replace
  compound column test, expand model description

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Resolved all 10 undocumented-column warnings found by ol-dbt validate:

- _dim_course_run.yml: add courserun_start_on, courserun_end_on, enrollment_start,
  enrollment_end, source_id
- _course_catalog_dimensions.yml (dim_topic): add parent_topic_name
- _dim__models.yml (dim_user): add micromasters_user_id
- _dim__models.yml (tfact_discussion_events): add courserun_fk
- _dim__models.yml (tfact_chatbot_events): add platform_fk
- _fact_tables.yml (tfact_order): add discount_type_fk
- _marts_micromasters__models.yml (marts__micromasters_program_certificates): add
  program_is_dedp
- _combined_models.yml (int__combined__courserun_enrollments): add
  courseruncertificate_created_on

Remaining warnings are on dim_program (SELECT * cannot be statically resolved by
sqlglot — requires model rewrite to fix).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…imary_platform

The model aliases 'platform' as 'primary_platform' in the final SELECT.
The model-level uniqueness test was referencing the non-existent 'platform'
column, causing COLUMN_NOT_FOUND errors at test runtime.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dim_course: unique test on course_pk must be scoped to is_current = true.
SCD2 intentionally produces multiple rows per course_pk (current + historical
versions). The bare unique test failed with 138 results. dim_course_run
(same SCD2 pattern) correctly omits the bare unique test; aligned dim_course.

tfact_studentmodule_problems: dedup dim_user joins per platform to prevent
fan-out when the same openedx_user_id maps to multiple dim_user rows (merged
identities). Added per-platform CTEs (mitxonline_users, mitxpro_users,
residential_users) that GROUP BY the joining key, ensuring 1:1 join. This
eliminates 15 duplicate rows that caused the compound unique test to fail
with mitxpro data.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Watermark scalar subqueries return NULL for platforms with no existing rows
in the target table (e.g. a new platform added after the initial backfill,
or the very first incremental run for a platform with no events). The condition
'event_timestamp > NULL' evaluates to NULL (falsy), silently filtering ALL
rows for that platform with no error.

tfact_problem_events: Replace per-platform watermark CTEs + scalar subquery
pattern with LEFT JOIN directly on the shared watermarks CTE per platform,
adding (watermarks.max_ts IS NULL OR timestamp > watermarks.max_ts). This
matches the established pattern in tfact_certificate/tfact_enrollment/tfact_order.

transform_studentmodule_data macro: Wrap both the watermark_expr and the
inline fallback subquery paths with the same IS NULL OR > guard.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
- dim_program.sql: MicroMasters platform_readable_id was 'MITx Online'
  (copy/paste error). Replace with var("micromasters") which resolves to
  'MicroMasters', consistent with platform_code = 'micromasters'.

- tfact_problem_events.sql: combined_studentmodule CTE used Trino-specific
  max_by(col, event_timestamp) for per-attempt aggregation; not portable to
  DuckDB. Replace with ROW_NUMBER() over (...order by event_timestamp desc)
  windowing approach using two CTEs (combined_studentmodule_ranked + filter).
  Also fix broken watermark references: the old per-platform watermark CTEs
  (mitxonline_watermark etc.) were removed in the prior NULL-watermark fix but
  the combined_studentmodule WHERE clause still referenced them. Switch to the
  shared watermarks CTE with LEFT JOIN + IS NULL OR pattern, consistent with
  all other CTEs in this model.

- transform_studentmodule_data.sql: Replace Trino-specific json_query() calls
  with cross-db macros (json_query_string for scalar strings, json_extract_value
  for JSON object/array values) so the macro runs on DuckDB and StarRocks.

- _fact_tables.yml: Update tfact_payment user_fk description — it is no longer
  always null; tfact_payment.sql resolves it via the order→user bridge
  (order_user_lookup + dim_user joins). Describe actual nullability conditions.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…latforms

MicroMasters: course_position_in_program is a first-class integer field on the
courses_course table (0-based, e.g. SC0x=0, SC1x=1, SC2x=2). Thread it through
int__micromasters__program_requirements and into the bridge as course_order.
Both core and elective courses carry this value via a join back to the courses CTE.

MITx Online: programrequirement uses Django Treebeard materialized paths where
sibling position is encoded lexicographically (e.g. ...0001, ...0002, ...0003).
Capture the course node's own path as course_node_path, thread it through
child_requirements and combined_requirements, then compute course_order_in_program
as row_number() over (partition by program_id order by course_node_path).

xPro: cms_coursesinprogrampage.contents is an ordered JSON array of Wagtail page
IDs whose element position = display order. The existing staging model preserves
this as array(integer). Change UNNEST to UNNEST WITH ORDINALITY in
int__mitxpro__coursesinprogram to recover the array position as course_order.

bridge_program_course: replace hardcoded '1 as course_order' placeholder with
the real course_order values from each platform's intermediate model.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atform fan-out

dim_course_content: add platform column to all 4 source union branches;
propagate through latest_course_structure (group by), window function
PARTITION BYs, combined_course_content, and final SELECT.

dim_video: inherit platform from dim_course_content; fix dedup
row_number() partition to (platform, block_id); expose platform in output.

dim_discussion_topic: inherit platform from dim_course_content; add to
both union branches; include platform in surrogate key generation to
prevent collisions when same courserun_readable_id exists on multiple
platforms; expose platform in output.

dim_problem: inherit platform from dim_course_content; fix dedup
row_number() partition to (platform, block_id); expose platform in output.

afact_course_page_engagement: scope vertical_structure join by platform.
afact_discussion_engagement: scope discussion_topics and course_content
joins by platform.
afact_problem_engagement: scope problem_structure join by platform.
afact_video_engagement: scope all 5 joins (d_video + 4 course_content
self-joins) by platform via chain propagation.

tfact_certificate: scope edxorg/micromasters courserun_readable_id join
to dim_course_run.platform = 'edxorg' (micromasters courses ran on edxorg
infrastructure; same readable_id confirmed to exist on 2 platforms).

Fixes confirmed fan-out: 3 courserun_readable_ids (MITx/14_73x/1T2014,
MITx/14.73x/2013_Spring, MITx/14.73x_1/1T2015) exist on both edxorg
and mitxonline in dim_course_run.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
_dim__models.yml:
- dim_course_content: add platform column with not_null + accepted_values
  tests; update compound uniqueness test to include platform
  (was courserun_readable_id+block_id, now platform+courserun_readable_id+block_id)
- dim_video: add platform column; replace bare unique test on video_block_pk
  with compound_columns_to_be_unique(platform, video_block_pk) since the
  same block_id can exist on multiple platforms
- dim_problem: same pattern as dim_video
- dim_discussion_topic: add platform column with not_null test

README.md (new):
- Documents that courserun_readable_id / course_readable_id are NOT globally
  unique identifiers -- they must always be paired with platform
- Lists 3 confirmed duplicate readable IDs (MITx/14_73x/1T2014 etc.)
- Shows correct and incorrect join patterns
- Documents which dimension tables require platform scoping and why
- Documents platform values and their meanings
- Documents the incremental watermark LEFT JOIN pattern

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ct_order

The staging model stg__mitxonline__app__postgres__ecommerce_order has
order_updated_on (from the updated_on column), but int__mitxonline__ecommerce_order
never selected it, causing tfact_order to fall back to order_created_on as
a substitute with a misleading comment 'mitxonline has no updated_on'.

This caused a silent data correctness bug: the incremental watermark in
tfact_order uses max(order_updated_on) per platform to detect changed rows.
With order_updated_on = order_created_on for MITx Online, any order state
change (e.g. pending -> fulfilled) after the initial run is invisible to
the watermark and never re-processed, leaving stale order states in the
fact table indefinitely.

Fix: add order_updated_on to int__mitxonline__ecommerce_order SELECT;
remove the order_created_on substitution and comment from tfact_order.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
dim_course_content: replace Trino-specific 'last_value(expr) IGNORE NULLS
OVER (...)' with new last_value_ignore_nulls() dispatch macro. Trino places
IGNORE NULLS after the closing paren; DuckDB requires it inside the function
arguments. Using the macro ensures correct syntax on both adapters.

cross_db_functions.sql: add last_value_ignore_nulls dispatch macro.
  default (Trino): last_value(expr) ignore nulls
  duckdb:          last_value(expr ignore nulls)
Both forms are followed by the OVER clause written inline at the call site.

int__micromasters__program_requirements: qualify all column references in
the elective branch of combined_courses CTE to resolve AMBIGUOUS_NAME error
on 'program_id' and 'course_id' (both elective_courses and courses carry
these columns after the inner join added for course_position_in_program).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…topic and dim_problem

- Add unnest_json_map dispatch macro (Trino: native MAP UNNEST, DuckDB: map_keys/map_values subquery)
- Add json_is_object dispatch macro (DuckDB: json_type='OBJECT', Trino: substr start check)
  - Use substr() instead of left() to avoid LEFT JOIN keyword ambiguity in Trino parser
- Add last_value_ignore_nulls dispatch macro (already used by dim_course_content)
- dim_discussion_topic: replace Trino-specific MAP UNNEST and json_extract_scalar with cross-db macros
- dim_problem: replace Trino-specific MAP UNNEST and json_extract_scalar with cross-db macros;
  pre-filter non-object JSON submissions via json_is_object before UNNEST;
  wrap source reads in subquery casting event columns to VARCHAR to handle
  heterogeneous parquet column types in S3 tracking log files

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…atibility

Trino's json_query() expects varchar input but map_values(map(varchar, json))
produces the native json type. Use transform_values to cast each value to varchar
before unnesting, so downstream json_query_string calls receive varchar.
transform_values(NULL, ...) propagates NULL safely → UNNEST(NULL) = 0 rows.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…st_json_map

Trino does not support cast(json as varchar); the correct function is json_format()
which serializes the native json type back to its varchar text representation.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Trino does not support cast(json as varchar); json_format() is required
to serialize native json type to varchar text. This affects the json_is_object
pre-filter in dim_problem's problem_events CTE.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…before dim join

Cybersource Secure Acceptance sends req_payment_method='card' for credit card
transactions, but dim_payment_method uses the canonical code 'credit_card'.
The raw value flowed unchanged into tfact_payment's LEFT JOIN, silently producing
NULL payment_method_fk for all credit card rows. No not_null test existed to
detect this.

Fix:
- Normalize 'card' → 'credit_card' via CASE in both intermediate models
  (int__mitxonline__ecommerce_transaction, int__mitxpro__ecommerce_receipt)
  so all other consumers of these fields also get canonical values
- Add not_null test to tfact_payment.payment_method_fk to prevent regression
- Document the raw→canonical mapping in dim_payment_method comment

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…nrollments

tfact_enrollment processes residential enrollments and joins them to dim_course_run
using platform = 'residential'. Since dim_course_run had no residential rows,
all residential enrollment facts had courserun_fk = NULL (silent data integrity failure).

Add residential_courseruns CTE sourced from int__mitxresidential__courseruns using the
same pattern as edxorg: source_id/course_id as NULL (no integer PK available),
courserun_is_live as NULL (residential does not expose a published/live flag).
The 'residential' platform already exists in dim_platform (platforms.csv seed).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Cybersource does not return req_payment_method for refund transactions and
zero-dollar payments, resulting in ~17k legitimate NULL payment_method_fk values
in tfact_payment. The not_null constraint was too strict.

Replace with a relationships test scoped to non-null values (matching the pattern
used by other nullable FKs like user_fk). Document the nullability in both the
yml description and dim_payment_method.

The 'card' → 'credit_card' normalization added in the prior commit remains correct
and prevents NULLs for actual credit card charges.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…n Trino

Trino's from_iso8601_timestamp() uses the session timezone (America/New_York) to
interpret timezone-less ISO-8601 strings. Timestamps that fall in the DST spring-
forward gap (e.g. 2020-03-08T02:19:36) throw:
  INVALID_FUNCTION_ARGUMENT: Illegal instant due to time zone offset transition

All source timestamps in this project are stored as UTC without a timezone suffix
(Django default behavior). Replace from_iso8601_timestamp() with
with_timezone(cast(x as timestamp), 'UTC') which correctly interprets them as UTC.

Changes:
- from_iso8601_timestamp macro (default/Trino): use with_timezone(cast, 'UTC')
- _stg_mitxpro__models.yml recency test field: same
- _stg_mitxonline__models.yml recency test field: fix from_iso8601_timestamp_nanos
- _mitxpro__sources.yml loaded_at_field: same
- _mitxonline__sources.yml loaded_at_field: same

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…1_timestamp macro

Previous fix (with_timezone + cast) broke strings that already carry a timezone offset
(e.g. '2025-05-13T11:38:39.923495000-04:00' from to_iso8601()) because Trino's
cast(varchar as timestamp) cannot parse ISO-8601 strings with offsets or nanoseconds.

Correct approach: detect whether the string already has a timezone marker
(ending in Z/z or +HH:MM/-HH:MM via regexp_like) and only append 'Z' when absent.
Strings with an existing zone pass through to from_iso8601_timestamp() unchanged.

Also update YAML recency test fields and source freshness loaded_at_field to use
from_iso8601_timestamp(col || 'Z') directly (these fields are known UTC-without-TZ).

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…updated_on tracking

Addresses PR review feedback from @rachellougee:

1. Program enrollment null fields: int__mitxonline__programenrollments has
   enrollment_mode and enrollment_status; they were hardcoded as null in tfact_enrollment.
   Now populated from programenrollment_enrollment_mode/status.

2. Status change detection: tfact_enrollment watermark was based solely on
   enrollment_created_on. Status/active flag changes (e.g. deactivation) would not
   advance the watermark and would be silently dropped on incremental runs.
   Fix: watermark on coalesce(enrollment_updated_on, enrollment_created_on) and use
   >= comparison so boundary-second updates are recaptured.

3. Propagate updated_on through intermediates:
   - int__mitxonline__courserunenrollments: add courserunenrollment_updated_on
   - int__mitxpro__courserunenrollments: add courserunenrollment_updated_on
   - int__mitxonline__programenrollments: add programenrollment_updated_on
   These replace the cast(null as varchar) placeholders in tfact_enrollment.

   edxorg (S3 snapshot) and residential (OpenEdX student_courseenrollment) have no
   updated_on field in their source tables; those remain null as enrollment_updated_on.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…act_enrollment

courserunenrollment_enrollment_status was already present in both the xPro staging
model (as change_status) and intermediate model, but was hardcoded as null in
tfact_enrollment's mitxpro_enrollments CTE. Now correctly mapped.

edxorg (S3 snapshot) and residential (OpenEdX) genuinely have no enrollment_status
field in their source tables; those remain null.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
edxorg courserun_readable_id uses old-style slash format (MITx/6.00.1x/1T2014),
converted from course-v1: format in int__edxorg__mitx_courseruns. The CASE statement
in the dim_course join only stripped the run segment for 'course-v1:' prefixed IDs;
the ELSE branch passed the full 3-segment slash ID as the join key, which never
matched dim_course.course_readable_id (2-segment: MITx/6.00.1x), resulting in
course_fk = NULL for all edxorg course runs.

Fix: add a second WHEN branch using regexp_like to detect 3-segment slash IDs and
strip the trailing /{run} segment before the join.

Note: the AI agent's stated hypothesis (that dim_course stores 3-segment IDs) was
incorrect — dim_course always stores 2-segment course IDs. The actual bug was in
the ELSE branch silently passing the run ID as the course ID for slash-format runs.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
course_number was selected in new_and_changed_courses but omitted from the
incremental change-detection WHERE predicate. Changes to course_number alone
(without a concurrent title/description/is_live change) would be silently dropped
— no new historical record created and the existing row left with stale data.

Fix: add coalesce(existing.course_number, '') = coalesce(current_courses.course_number, '')
to the NOT EXISTS predicate, consistent with how course_description is handled.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
…ection

In dim_course_run, residential course runs have courserun_is_live = NULL (set
explicitly in the residential_courseruns CTE). The comparison
  existing.courserun_is_live = courseruns_with_all_fks.courserun_is_live
evaluates to NULL (not TRUE) when both sides are NULL, so the NOT EXISTS predicate
never matches existing residential rows. Every incremental run treats all residential
course runs as changed, creating unbounded duplicate SCD2 records.

Same pattern found and fixed in dim_course for course_is_live, which can be NULL
for platforms without a native is_live flag.

Fix: wrap both sides with coalesce(..., false) — consistent with how other nullable
fields in the same predicate are handled (courserun_start_on etc. use coalesce with '').

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copy link
Copy Markdown
Contributor

@rachellougee rachellougee left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think this is good enough for the first pass since dbt runs successfully. However, there are some data quality issues we’ll need to address once we start using these models in reporting - e.g. dim_course_run, dim_course, dim_program

For example, dim_course_run still has a failing test that is currently only set to warn:

15:26:59  Warning in test not_null_dim_course_run_course_fk (models/dimensional/_dim_course_run.yml)
15:26:59  Got 2872 results, configured to warn if != 0
15:26:59  

Looking into one case, the course_pk for source_id = 1803 in dim_course_run is null, but it should reference source_id = 255 in dim_course.

Since these models are not currently used in any downstream models, I’m okay leaving it as-is for now, but we will need to validate them once we start using them.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

5 participants