Upstream: dotnet#7380
Status: COMPLETE
Classification: bug-report
Confidence: 0.65
Reproduced: ⏭️ Skipped (requires remote database with large dataset)
Area: Core (Microsoft.ML.StandardTrainers / LdSvm)
Investigated at: 2026-03-07
Triage Summary
Category: Bug Report
Reasoning: The user reports that LdSvmTrainer continues to generate excessive network/SQL traffic and is ~100x slower than LightGBM/FastForest despite correctly using mlContext.Data.Cache(train) and setting Cache=true in LdSvmTrainer.Options. The Cache=true option defaults to true and is supposed to load all examples into memory before training, so the behavior of ongoing SQL reads is unexpected. The ML.NET contributor has not been able to explain or resolve the discrepancy despite active investigation through multiple comments.
Summary: With hundreds of columns and millions of rows loaded from a remote PostgreSQL database, LdSvmTrainer runs ~100x slower and uses significantly less memory than LightGBM/FastForest on the same pipeline. The user correctly re-assigns the cached data (train = mlContext.Data.Cache(train)) and sets Cache=true in trainer options, yet the database connection shows ongoing high network traffic rather than a single initial read followed by in-memory training. This suggests the caching mechanism is not preventing repeated reads from the database loader.
Suggested Labels: bug, needs-info
Source Code Analysis
The LdSvmTrainer has two caching layers:
- User-level cache via
mlContext.Data.Cache(data) — caches raw loaded data before transforms
- Trainer-level cache via
Cache=true option — implemented in CachedData class in LdSvmTrainer.cs (src/Microsoft.ML.StandardTrainers/LdSvm/LdSvmTrainer.cs, line 275-276, 501-556)
The CachedData constructor reads all transformed examples into a LabelFeatures[] array via a single FloatLabelCursor pass. Once loaded, all 1000 iterations sample from the in-memory array. This should prevent repeat DB reads.
Possible root causes:
- The per-column
NormalizeMinMax in a loop (one transform appended per column) may cause N separate passes through data to fit the pipeline — but this should still use the cached source data, not the DB
- A potential bug in how
DatabaseLoader interacts with DataCache, where the cache is not being correctly hit for repeated enumerations
Reproduction Notes
Reproduction requires:
- A PostgreSQL (or similar) database with millions of rows and hundreds of columns
- The issue manifests as high sustained SQL network traffic during
pipeline.Fit(), visible via network monitoring
The key reproduction steps from the user's code:
- Load via
DatabaseLoader + NpgsqlFactory
- Cache with
train = mlContext.Data.Cache(train)
- Build a pipeline with per-column MinMax normalization, type conversion, OneHot encoding, concatenation
- Set
Cache=true and NumberOfIterations=1000 in LdSvmTrainer.Options
- Compare network traffic vs. LightGBM on same pipeline
Generated by Triage Single Issue · ◷
Upstream: dotnet#7380
Status: COMPLETE
Classification: bug-report
Confidence: 0.65
Reproduced: ⏭️ Skipped (requires remote database with large dataset)
Area: Core (Microsoft.ML.StandardTrainers / LdSvm)
Investigated at: 2026-03-07
Triage Summary
Category: Bug Report
Reasoning: The user reports that LdSvmTrainer continues to generate excessive network/SQL traffic and is ~100x slower than LightGBM/FastForest despite correctly using
mlContext.Data.Cache(train)and settingCache=trueinLdSvmTrainer.Options. TheCache=trueoption defaults totrueand is supposed to load all examples into memory before training, so the behavior of ongoing SQL reads is unexpected. The ML.NET contributor has not been able to explain or resolve the discrepancy despite active investigation through multiple comments.Summary: With hundreds of columns and millions of rows loaded from a remote PostgreSQL database, LdSvmTrainer runs ~100x slower and uses significantly less memory than LightGBM/FastForest on the same pipeline. The user correctly re-assigns the cached data (
train = mlContext.Data.Cache(train)) and setsCache=truein trainer options, yet the database connection shows ongoing high network traffic rather than a single initial read followed by in-memory training. This suggests the caching mechanism is not preventing repeated reads from the database loader.Suggested Labels: bug, needs-info
Source Code Analysis
The
LdSvmTrainerhas two caching layers:mlContext.Data.Cache(data)— caches raw loaded data before transformsCache=trueoption — implemented inCachedDataclass inLdSvmTrainer.cs(src/Microsoft.ML.StandardTrainers/LdSvm/LdSvmTrainer.cs, line 275-276, 501-556)The
CachedDataconstructor reads all transformed examples into aLabelFeatures[]array via a singleFloatLabelCursorpass. Once loaded, all 1000 iterations sample from the in-memory array. This should prevent repeat DB reads.Possible root causes:
NormalizeMinMaxin a loop (one transform appended per column) may cause N separate passes through data to fit the pipeline — but this should still use the cached source data, not the DBDatabaseLoaderinteracts withDataCache, where the cache is not being correctly hit for repeated enumerationsReproduction Notes
Reproduction requires:
pipeline.Fit(), visible via network monitoringThe key reproduction steps from the user's code:
DatabaseLoader+NpgsqlFactorytrain = mlContext.Data.Cache(train)Cache=trueandNumberOfIterations=1000inLdSvmTrainer.Options