RFC: Derived Columns

### Problem

Schemas often include columns that are deterministic functions of other columns. Today, users must compute these outside of dataframely before validation:

```python
df = df.with_columns(
    age=(pl.date.today() - pl.col("birth_date")).dt.total_days() // 365
)
validated = PersonSchema.validate(df)
```

This scatters transformation logic across the codebase and breaks the "schema as source of truth" model.

### Proposed API

#### 1. `@dy.derived()` decorator

Define derived columns alongside rules, using the same pattern:

```python
class PersonSchema(dy.Schema):
    birth_date = dy.Date(nullable=False)
    first_name = dy.String(nullable=False)
    last_name = dy.String(nullable=False)

    # Derived columns
    age = dy.Int64(nullable=False)
    full_name = dy.String(nullable=False)

    @dy.derived("age")
    def derive_age(cls) -> pl.Expr:
        return (pl.date.today() - cls.birth_date.col).dt.total_days() // 365.25

    @dy.derived("full_name")
    def derive_full_name(cls) -> pl.Expr:
        return cls.first_name.col + pl.lit(" ") + cls.last_name.col
```

#### 2. `Schema.with_derived()` method

Explicitly apply derivations to a dataframe:

```python
# Input only needs source columns
df = pl.DataFrame({
    "birth_date": [date(1990, 5, 15), date(2000, 1, 1)],
    "first_name": ["Alice", "Bob"],
    "last_name": ["Smith", "Jones"],
})

# Add derived columns
df_with_derived = PersonSchema.with_derived(df)
# Now has: birth_date, first_name, last_name, age, full_name

# Then validate as usual
validated = PersonSchema.validate(df_with_derived)
```

### Expected Behavior

- **Derived columns are optional in input.** `with_derived()` adds them if missing, overwrites if present.
- **Lazy frames are preserved.** `with_derived()` returns a `LazyFrame` if given a `LazyFrame`.
- **Invalid targets error at class definition time.** `@dy.derived("x")` raises if `x` is not a column in the schema.

### Open Questions

1. **Circular dependencies?** Should we detect/error on `a` derived from `b` derived from `a`? If so, this requires topological sorting of derivations—is this doable?

2. **Chained derivations?** Should derived columns be allowed to depend on other derived columns? e.g., `age` derived from `birth_date`, then `is_adult` derived from `age`. This would require ordering derivations correctly (topological sort).

3. **Serialization?** Should derived column expressions be included in `Schema.serialize()`? Unclear how this interacts with the existing serialization machinery.


Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Derived Columns #237

Problem

Proposed API

1. `@dy.derived()` decorator

2. `Schema.with_derived()` method

Expected Behavior

Open Questions

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

RFC: Derived Columns #237

Description

Problem

Proposed API

1. @dy.derived() decorator

2. Schema.with_derived() method

Expected Behavior

Open Questions

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions

1. `@dy.derived()` decorator

2. `Schema.with_derived()` method