-
Notifications
You must be signed in to change notification settings - Fork 13
Description
Problem
Schemas often include columns that are deterministic functions of other columns. Today, users must compute these outside of dataframely before validation:
df = df.with_columns(
age=(pl.date.today() - pl.col("birth_date")).dt.total_days() // 365
)
validated = PersonSchema.validate(df)This scatters transformation logic across the codebase and breaks the "schema as source of truth" model.
Proposed API
1. @dy.derived() decorator
Define derived columns alongside rules, using the same pattern:
class PersonSchema(dy.Schema):
birth_date = dy.Date(nullable=False)
first_name = dy.String(nullable=False)
last_name = dy.String(nullable=False)
# Derived columns
age = dy.Int64(nullable=False)
full_name = dy.String(nullable=False)
@dy.derived("age")
def derive_age(cls) -> pl.Expr:
return (pl.date.today() - cls.birth_date.col).dt.total_days() // 365.25
@dy.derived("full_name")
def derive_full_name(cls) -> pl.Expr:
return cls.first_name.col + pl.lit(" ") + cls.last_name.col2. Schema.with_derived() method
Explicitly apply derivations to a dataframe:
# Input only needs source columns
df = pl.DataFrame({
"birth_date": [date(1990, 5, 15), date(2000, 1, 1)],
"first_name": ["Alice", "Bob"],
"last_name": ["Smith", "Jones"],
})
# Add derived columns
df_with_derived = PersonSchema.with_derived(df)
# Now has: birth_date, first_name, last_name, age, full_name
# Then validate as usual
validated = PersonSchema.validate(df_with_derived)Expected Behavior
- Derived columns are optional in input.
with_derived()adds them if missing, overwrites if present. - Lazy frames are preserved.
with_derived()returns aLazyFrameif given aLazyFrame. - Invalid targets error at class definition time.
@dy.derived("x")raises ifxis not a column in the schema.
Open Questions
-
Circular dependencies? Should we detect/error on
aderived frombderived froma? If so, this requires topological sorting of derivations—is this doable? -
Chained derivations? Should derived columns be allowed to depend on other derived columns? e.g.,
agederived frombirth_date, thenis_adultderived fromage. This would require ordering derivations correctly (topological sort). -
Serialization? Should derived column expressions be included in
Schema.serialize()? Unclear how this interacts with the existing serialization machinery.