Skip to content

RFC: Derived Columns #237

@brendancooley

Description

@brendancooley

Problem

Schemas often include columns that are deterministic functions of other columns. Today, users must compute these outside of dataframely before validation:

df = df.with_columns(
    age=(pl.date.today() - pl.col("birth_date")).dt.total_days() // 365
)
validated = PersonSchema.validate(df)

This scatters transformation logic across the codebase and breaks the "schema as source of truth" model.

Proposed API

1. @dy.derived() decorator

Define derived columns alongside rules, using the same pattern:

class PersonSchema(dy.Schema):
    birth_date = dy.Date(nullable=False)
    first_name = dy.String(nullable=False)
    last_name = dy.String(nullable=False)

    # Derived columns
    age = dy.Int64(nullable=False)
    full_name = dy.String(nullable=False)

    @dy.derived("age")
    def derive_age(cls) -> pl.Expr:
        return (pl.date.today() - cls.birth_date.col).dt.total_days() // 365.25

    @dy.derived("full_name")
    def derive_full_name(cls) -> pl.Expr:
        return cls.first_name.col + pl.lit(" ") + cls.last_name.col

2. Schema.with_derived() method

Explicitly apply derivations to a dataframe:

# Input only needs source columns
df = pl.DataFrame({
    "birth_date": [date(1990, 5, 15), date(2000, 1, 1)],
    "first_name": ["Alice", "Bob"],
    "last_name": ["Smith", "Jones"],
})

# Add derived columns
df_with_derived = PersonSchema.with_derived(df)
# Now has: birth_date, first_name, last_name, age, full_name

# Then validate as usual
validated = PersonSchema.validate(df_with_derived)

Expected Behavior

  • Derived columns are optional in input. with_derived() adds them if missing, overwrites if present.
  • Lazy frames are preserved. with_derived() returns a LazyFrame if given a LazyFrame.
  • Invalid targets error at class definition time. @dy.derived("x") raises if x is not a column in the schema.

Open Questions

  1. Circular dependencies? Should we detect/error on a derived from b derived from a? If so, this requires topological sorting of derivations—is this doable?

  2. Chained derivations? Should derived columns be allowed to depend on other derived columns? e.g., age derived from birth_date, then is_adult derived from age. This would require ordering derivations correctly (topological sort).

  3. Serialization? Should derived column expressions be included in Schema.serialize()? Unclear how this interacts with the existing serialization machinery.

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions