Forward schema_path/target_class to linkml's delimited file loader#235
Forward schema_path/target_class to linkml's delimited file loader#235Sigfried wants to merge 3 commits into
Conversation
Pass through the new schema-aware loading params so that string-ranged and enum-ranged columns in TSV/CSV files are not coerced to int/float. Requires linkml >=1.11 (PR linkml/linkml#3289 added schema-awareness to the underlying _DelimitedFileLoader; released in v1.11.0). Also imports nothing new from `typing` — the new annotations use PEP 604 `X | Y | None` syntax to match the rest of the file's style. Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
linkml/linkml#3289 was released in linkml v1.11.0; schema-automator/#188 was released in v0.5.5. Switch both from git URL pins to PyPI version specifiers. linkml-map fix is still unreleased (PR linkml/linkml-map#235 open) — its git pin stays in place until that ships. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Drop the kwargs-only schema_path/target_class from get_file_loader and let them flow through **kwargs to the loader class. The DataLoader call sites add them to the same conditional block that already gates skip_empty_rows, so YAML/JSON paths never see them. Passing them to a non-tabular loader now raises TypeError instead of being silently swallowed.
There was a problem hiding this comment.
I'd like to back out the new schema_path / target_class params on DataLoader and get_file_loader — we already have source_schemaview on Session / Transformer, and the identifier passed to data_loader[...] already names the target class. Inlines mark the spots.
| self, | ||
| source: str | Path, | ||
| skip_empty_rows: bool = True, | ||
| schema_path: str | Path | None = None, |
There was a problem hiding this comment.
The leaf-level exposure is right and should stay. The reshape below is about not duplicating the same parameters higher up the stack — DataLoader and get_file_loader don't need to grow these too.
| base_path: str | Path, | ||
| default_format: FileFormat | None = None, | ||
| skip_empty_rows: bool = True, | ||
| schema_path: str | Path | None = None, |
There was a problem hiding this comment.
We don't need new infrastructure here — the schema is already in scope at the orchestration layer. Session.source_schemaview and Transformer.source_schemaview carry it, and the spec declares source_schema.
Suggested: replace schema_path / target_class with an optional schemaview: SchemaView | None, and let transform_spec and the CLI pass the existing source_schemaview in. End users then get the fix without changing how they call linkml-map.
| if file_format in (FileFormat.TSV, FileFormat.CSV): | ||
| loader_kwargs["skip_empty_rows"] = self.skip_empty_rows | ||
| loader_kwargs["schema_path"] = self.schema_path | ||
| loader_kwargs["target_class"] = self.target_class |
There was a problem hiding this comment.
The identifier already names the target class — data_loader["Person"] means we want rows typed as Person. Derive target_class = identifier here instead of storing it on self.
Storing a single target_class on the loader doesn't compose with directory mode anyway: different files map to different classes.
| if file_format in (FileFormat.TSV, FileFormat.CSV): | ||
| loader_kwargs["skip_empty_rows"] = self.skip_empty_rows | ||
| loader_kwargs["schema_path"] = self.schema_path | ||
| loader_kwargs["target_class"] = self.target_class |
There was a problem hiding this comment.
Single-file mode: derive target_class = self.base_path.stem to match how identifiers map to files everywhere else in the loader.
Summary
Forward
schema_pathandtarget_classthroughTsvFileLoader,CsvFileLoader,DataLoader, andget_file_loader()so they reach linkml's underlyingTsvLoader/CsvLoader. This enables schema-aware type coercion for delimited files, preventing string-ranged and enum-ranged columns from being silently coerced to int/float.Without these parameters, a column like
subject_idcontaining numeric-looking strings (e.g.,"00123") gets loaded as123(int) by pandas' default inference, losing the leading zero and breaking downstream lookups.Background
linkml/linkml#3289 ("Make delimited file loader schema-aware to preserve string/enum columns") added schema-awareness to linkml's
_DelimitedFileLoader. It was released in linkml v1.11.0. This PR exposes those parameters through linkml-map's loader API so users of linkml-map can benefit from the same fix.Changes
TsvFileLoader.__init__/CsvFileLoader.__init__: accept optionalschema_pathandtarget_classiter_instances(): forward both params to the underlying linkml loaderget_file_loader(): kwargs-onlyschema_path/target_class, passed only to TSV/CSV loader classes (other formats already do their own type handling)DataLoader.__init__: accept and store both paramsThe change is additive — existing callers that don't pass the new params get current behavior. Annotations use PEP 604 `X | Y | None` syntax to match the rest of the file's style (no new `typing` imports needed).
Test plan
🤖 Generated with Claude Code