[triage] upstream#7380: LdSvmTrainer slow behaviour when loading from a remote database

**Upstream:** https://github.com/dotnet/machinelearning/issues/7380
**Status:** COMPLETE
**Classification:** bug-report
**Confidence:** 0.65
**Reproduced:** ⏭️ Skipped (requires remote database with large dataset)
**Area:** Core (Microsoft.ML.StandardTrainers / LdSvm)
**Investigated at:** 2026-03-07

---

## Triage Summary

**Category:** Bug Report
**Reasoning:** The user reports that LdSvmTrainer continues to generate excessive network/SQL traffic and is ~100x slower than LightGBM/FastForest despite correctly using `mlContext.Data.Cache(train)` and setting `Cache=true` in `LdSvmTrainer.Options`. The `Cache=true` option defaults to `true` and is supposed to load all examples into memory before training, so the behavior of ongoing SQL reads is unexpected. The ML.NET contributor has not been able to explain or resolve the discrepancy despite active investigation through multiple comments.

**Summary:** With hundreds of columns and millions of rows loaded from a remote PostgreSQL database, LdSvmTrainer runs ~100x slower and uses significantly less memory than LightGBM/FastForest on the same pipeline. The user correctly re-assigns the cached data (`train = mlContext.Data.Cache(train)`) and sets `Cache=true` in trainer options, yet the database connection shows ongoing high network traffic rather than a single initial read followed by in-memory training. This suggests the caching mechanism is not preventing repeated reads from the database loader.

**Suggested Labels:** bug, needs-info

## Source Code Analysis

The `LdSvmTrainer` has two caching layers:
1. **User-level cache** via `mlContext.Data.Cache(data)` — caches raw loaded data before transforms
2. **Trainer-level cache** via `Cache=true` option — implemented in `CachedData` class in `LdSvmTrainer.cs` (`src/Microsoft.ML.StandardTrainers/LdSvm/LdSvmTrainer.cs`, line 275-276, 501-556)

The `CachedData` constructor reads all transformed examples into a `LabelFeatures[]` array via a single `FloatLabelCursor` pass. Once loaded, all 1000 iterations sample from the in-memory array. This *should* prevent repeat DB reads.

Possible root causes:
- The per-column `NormalizeMinMax` in a loop (one transform appended per column) may cause N separate passes through data to fit the pipeline — but this should still use the cached source data, not the DB
- A potential bug in how `DatabaseLoader` interacts with `DataCache`, where the cache is not being correctly hit for repeated enumerations

## Reproduction Notes

Reproduction requires:
- A PostgreSQL (or similar) database with millions of rows and hundreds of columns
- The issue manifests as high sustained SQL network traffic during `pipeline.Fit()`, visible via network monitoring

The key reproduction steps from the user's code:
1. Load via `DatabaseLoader` + `NpgsqlFactory`
2. Cache with `train = mlContext.Data.Cache(train)`
3. Build a pipeline with per-column MinMax normalization, type conversion, OneHot encoding, concatenation
4. Set `Cache=true` and `NumberOfIterations=1000` in `LdSvmTrainer.Options`
5. Compare network traffic vs. LightGBM on same pipeline




> Generated by [Triage Single Issue](https://github.com/JanKrivanek/machinelearning/actions/runs/22795813554) · [◷](https://github.com/search?q=repo%3AJanKrivanek%2Fmachinelearning+is%3Aissue+%22gh-aw-workflow-call-id%3A+JanKrivanek%2Fmachinelearning%2Ftriage-single-issue%22&type=issues)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[triage] upstream#7380: LdSvmTrainer slow behaviour when loading from a remote database #40

Triage Summary

Source Code Analysis

Reproduction Notes

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

[triage] upstream#7380: LdSvmTrainer slow behaviour when loading from a remote database #40

Description

Triage Summary

Source Code Analysis

Reproduction Notes

Metadata

Metadata

Assignees

Labels

Projects

Milestone

Relationships

Development

Issue actions