Skip to content

Forward schema_path/target_class to linkml's delimited file loader#235

Open
Sigfried wants to merge 3 commits into
linkml:mainfrom
Sigfried:forward-schema-path-to-delimited-loader
Open

Forward schema_path/target_class to linkml's delimited file loader#235
Sigfried wants to merge 3 commits into
linkml:mainfrom
Sigfried:forward-schema-path-to-delimited-loader

Conversation

@Sigfried
Copy link
Copy Markdown

Summary

Forward schema_path and target_class through TsvFileLoader, CsvFileLoader, DataLoader, and get_file_loader() so they reach linkml's underlying TsvLoader / CsvLoader. This enables schema-aware type coercion for delimited files, preventing string-ranged and enum-ranged columns from being silently coerced to int/float.

Without these parameters, a column like subject_id containing numeric-looking strings (e.g., "00123") gets loaded as 123 (int) by pandas' default inference, losing the leading zero and breaking downstream lookups.

Background

linkml/linkml#3289 ("Make delimited file loader schema-aware to preserve string/enum columns") added schema-awareness to linkml's _DelimitedFileLoader. It was released in linkml v1.11.0. This PR exposes those parameters through linkml-map's loader API so users of linkml-map can benefit from the same fix.

Changes

  • TsvFileLoader.__init__ / CsvFileLoader.__init__: accept optional schema_path and target_class
  • iter_instances(): forward both params to the underlying linkml loader
  • get_file_loader(): kwargs-only schema_path / target_class, passed only to TSV/CSV loader classes (other formats already do their own type handling)
  • DataLoader.__init__: accept and store both params
  • Tests: 11 new `TestSchemaAware*` tests in `tests/test_loaders/test_data_loader.py` covering string-range preservation, enum-range preservation, and the no-schema coercion fallback for each entry point (`TsvFileLoader`, `CsvFileLoader`, `get_file_loader`, `DataLoader`)

The change is additive — existing callers that don't pass the new params get current behavior. Annotations use PEP 604 `X | Y | None` syntax to match the rest of the file's style (no new `typing` imports needed).

Test plan

  • 42 tests pass in `tests/test_loaders/test_data_loader.py` (31 existing + 11 new)
  • Full `pytest` run: 831 passed, 4 skipped, 1 pre-existing failure (`test_graphviz_compiler` — fails on `main` too due to missing local `dot` binary, unrelated)
  • CI passes

🤖 Generated with Claude Code

Pass through the new schema-aware loading params so that string-ranged
and enum-ranged columns in TSV/CSV files are not coerced to int/float.

Requires linkml >=1.11 (PR linkml/linkml#3289 added schema-awareness to
the underlying _DelimitedFileLoader; released in v1.11.0).

Also imports nothing new from `typing` — the new annotations use PEP 604
`X | Y | None` syntax to match the rest of the file's style.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
Sigfried added a commit to linkml/dm-bip that referenced this pull request May 14, 2026
linkml/linkml#3289 was released in linkml v1.11.0; schema-automator/#188
was released in v0.5.5. Switch both from git URL pins to PyPI version
specifiers.

linkml-map fix is still unreleased (PR linkml/linkml-map#235 open) — its
git pin stays in place until that ships.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@amc-corey-cox amc-corey-cox added this to the v0.5.3 milestone May 22, 2026
Drop the kwargs-only schema_path/target_class from get_file_loader and
let them flow through **kwargs to the loader class. The DataLoader call
sites add them to the same conditional block that already gates
skip_empty_rows, so YAML/JSON paths never see them. Passing them to a
non-tabular loader now raises TypeError instead of being silently
swallowed.
@amc-corey-cox amc-corey-cox self-requested a review May 28, 2026 18:10
Copy link
Copy Markdown
Contributor

@amc-corey-cox amc-corey-cox left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'd like to back out the new schema_path / target_class params on DataLoader and get_file_loader — we already have source_schemaview on Session / Transformer, and the identifier passed to data_loader[...] already names the target class. Inlines mark the spots.

self,
source: str | Path,
skip_empty_rows: bool = True,
schema_path: str | Path | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The leaf-level exposure is right and should stay. The reshape below is about not duplicating the same parameters higher up the stack — DataLoader and get_file_loader don't need to grow these too.

base_path: str | Path,
default_format: FileFormat | None = None,
skip_empty_rows: bool = True,
schema_path: str | Path | None = None,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We don't need new infrastructure here — the schema is already in scope at the orchestration layer. Session.source_schemaview and Transformer.source_schemaview carry it, and the spec declares source_schema.

Suggested: replace schema_path / target_class with an optional schemaview: SchemaView | None, and let transform_spec and the CLI pass the existing source_schemaview in. End users then get the fix without changing how they call linkml-map.

if file_format in (FileFormat.TSV, FileFormat.CSV):
loader_kwargs["skip_empty_rows"] = self.skip_empty_rows
loader_kwargs["schema_path"] = self.schema_path
loader_kwargs["target_class"] = self.target_class
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The identifier already names the target class — data_loader["Person"] means we want rows typed as Person. Derive target_class = identifier here instead of storing it on self.

Storing a single target_class on the loader doesn't compose with directory mode anyway: different files map to different classes.

if file_format in (FileFormat.TSV, FileFormat.CSV):
loader_kwargs["skip_empty_rows"] = self.skip_empty_rows
loader_kwargs["schema_path"] = self.schema_path
loader_kwargs["target_class"] = self.target_class
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Single-file mode: derive target_class = self.base_path.stem to match how identifiers map to files everywhere else in the loader.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants